

Photo by Author | Canva
. Introduction
Finding real -world datases can be difficult because they are often private (safe), incomplete (lost features), or expensive (behind paywills). Artificial datases can solve these issues by allowing you to produce data based on your project requirements.
Artificial data are artificially developed information that imitates real -life datases. You can control the size, complexity and realism of the synthetic datastas based on your data requirements.
In this article, we will find ways to produce artificial data. Then we will create a portfolio project by checking the data, Machine learning modelAnd use AI to develop a complete portfolio project with a streamlit app.
. How to develop artificial data
Artificial figures are often created using random, rules, or AI.
![]()
![]()
!! Method 1: Random data generation
Wee Wee Wee, we will use easy functions to create values ​​without any particular rules.
This test is useful useful, but it will not achieve realistic relationships between features. We will do it using moisture Random way And make a Pandas data frame.
import numpy as np
import pandas as pd
np.random.seed(42)
df_random = pd.DataFrame({
"feature_a": np.random.randint(1, 100, 5),
"feature_b": np.random.rand(5),
"feature_c": np.random.choice(("X", "Y", "Z"), 5)
})
df_random.head()Here is an output.


!! Method 2: Rule -based data generation
Rule -based data generation is a smart and realistic way than random data generation. It follows a exact formula or set of rules. It makes the output meaningful and permanent.
In our example, the size of the house is directly linked to its price. To clearly show this, we will create a datastate with both size and price. We will explain the relationship with a formula:
Price = size × 300 + ε (random noise)
In this way, you can see the connection with the data reasonably realistic.
np.random.seed(42)
n = 5
size = np.random.randint(500, 3500, n)
price = size * 300 + np.random.randint(5000, 20000, n)
df_rule = pd.DataFrame({
"size_sqft": size,
"price_usd": price
})
df_rule.head()Here is an output.


!! Method 3: Simulation -based data generation
The method of fake data generation connects random changes with real world rules. This compound produces datases that treat real people.
What do we know about accommodation?
- Large homes usually cost more
- Some cities cost more than others
- A baseline price
How do we make a datastate?
- Choose a city in a random city
- Stretch the size of the house
- Set the bedroom between 1 and 5
- Calculate the price with a clear rule
Price rule: We start with a basic price, add city price conflict, and then add size × rate.
price_USD = Base_Price × City_Bump + Square Foot × Rate
Here is a code.
import numpy as np
import pandas as pd
rng = np.random.default_rng(42)
CITIES = ("los_angeles", "san_francisco", "san_diego")
# City price bump: higher means pricier city
CITY_BUMP = {"los_angeles": 1.10, "san_francisco": 1.35, "san_diego": 1.00}
def make_data(n_rows=10):
city = rng.choice(CITIES, size=n_rows)
# Most homes are near 1,500 sqft, some smaller or larger
sqft = rng.normal(1500, 600, n_rows).clip(350, 4500).round()
beds = rng.integers(1, 6, n_rows)
base = 220_000
rate = 350 # dollars per sqft
bump = np.array((CITY_BUMP(c) for c in city))
price = base * bump + sqft * rate
return pd.DataFrame({
"city": city,
"sqft": sqft.astype(int),
"beds": beds,
"price_usd": price.round(0).astype(int),
})
df = make_data()
df.head()Here is an output.


!! Method 4: AI -powered data generation
You need a clear indication to make AI your dataset. AI is powerful, but when you set easy, smart rules, it works excellent.
In the following indicators, we will add:
- Domain: What about data?
- Features: Which column do we want?
- City, neighborhood, square feet, bedroom, bathroom
- Relationships: How do features connect?
- Price depends on the city, SQFT, Bedroom and Crime Index
- Format: How should AI return it?
Here is a gesture.
Prepare a codes that form artificial California real estate datastas.
Dataste should have 10,000 rows with columns: city, neighborhood, latitude, dimensional, square feet, bedroom, bathroom, lot_s of cavity, year_ built, property_taype, hoses_gorj, state, school_skor, crime_ index, district_cam_cam_cam_cam_cam_cam_cam_cam_sad.
City: Los Angeles, San Francisco, San Diego, San Jose, Sacramento.
The price should be dependent at the City Premium, SQFT, bedrooms, bathrooms, lots size, school scores, crime index, and distance from the city center.
Add some random noise, lost values, and some outgoing.
Return the result as Pandas Data Frame and Save it to ‘ca_housing_synth.csv’
Let’s use this prompt with Chat GPT.


He returned the Dataset as CSV. This is the process that shows how Chat GPT created it.


This is the most complex dataset we have developed so far. Let’s look at the first few rows of this dataset.
![]()
![]()
. Construction of a portfolio project from artificial data
We used four different techniques to make artificial datasets. We will use AI-infield data to build the Portfolio Project.
First, we will discover the data, and then create a machine learning model. Next, we will take advantage of the AI ​​and imagine the results with the streamlit, and in the final stage, we will discover what steps are to deploy the model in production.
![]()
![]()
!! Step 1: Artificial dataset detection and understanding
We will start reading data with this first Pandas And showing the first few rows.
df = pd.read_csv("ca_housing_synth.csv")
df.head()Here is an output.
![]()
![]()
Dataset includes location (city, neighborhood, long -term, long -term) and property details (size, room, year, condition) as well as target price. Let’s check the length by using the information contained in column names, sizes and lengths Method of information.


We have 15 columns, such as something, such as has_garage or distant_km_center, are quite specific.
!! Step 2: Model Building
The next step is to create a machine learning model that predicts home prices.
We will follow these steps:
Here is a code.
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.inspection import permutation_importance
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# --- Step 1: Define columns based on the generated dataset
num_cols = ("sqft", "bedrooms", "bathrooms", "lot_sqft", "year_built",
"school_score", "crime_index", "dist_km_center", "latitude", "longitude")
cat_cols = ("city", "neighborhood", "property_type", "condition", "has_garage")
# --- Step 2: Split the data
X = df.drop(columns=("price_usd"))
y = df("price_usd")
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# --- Step 3: Preprocessing pipelines
num_pipe = Pipeline((
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler())
))
cat_pipe = Pipeline((
("imputer", SimpleImputer(strategy="most_frequent")),
("encoder", OneHotEncoder(handle_unknown="ignore"))
))
preprocessor = ColumnTransformer((
("num", num_pipe, num_cols),
("cat", cat_pipe, cat_cols)
))
# --- Step 4: Model
model = RandomForestRegressor(n_estimators=300, random_state=42, n_jobs=-1)
pipeline = Pipeline((
("preprocessor", preprocessor),
("model", model)
))
# --- Step 5: Train
pipeline.fit(X_train, y_train)
# --- Step 6: Evaluate
y_pred = pipeline.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"MAE: {mae:,.0f}")
print(f"RMSE: {rmse:,.0f}")
print(f"R²: {r2:.3f}")
# --- Step 7: (Optional) Permutation Importance on a subset for speed
pi = permutation_importance(
pipeline, X_test.iloc(:1000), y_test.iloc(:1000),
n_repeats=3, random_state=42, scoring="r2"
)
# --- Step 8: Plot Actual vs Predicted
plt.figure(figsize=(6, 5))
plt.scatter(y_test, y_pred, alpha=0.25)
vmin, vmax = min(y_test.min(), y_pred.min()), max(y_test.max(), y_pred.max())
plt.plot((vmin, vmax), (vmin, vmax), linestyle="--", color="red")
plt.xlabel("Actual Price (USD)")
plt.ylabel("Predicted Price (USD)")
plt.title(f"Actual vs Predicted (MAE={mae:,.0f}, RMSE={rmse:,.0f}, R²={r2:.3f})")
plt.tight_layout()
plt.show()Here is an output.


Model performance:
- MAE (85,877 USD): On average, predictions are almost less than 86k, which is appropriate in view of the change in accommodation prices
- RMSE (113.512 USD): Big mistakes are more punished. RMSE has confirmed that the model handles a considerable extent to a great extent
- Rollow (0.853): The model describes ~ 85 % variation in domestic prices, which shows strong forecast strength for artificial data.
!! Step 3: Imagine data
At this stage, we will showcase our process, including the EDA and the model building, using the Streamlit Dashboard. Why are we using? Streamlit? You can quickly and easily create a streamlit dashboard Deploy Seeing and talking with others.
Using Gemini CLI
We will use the Gemini CLI to create a Streamlit application.
Gemini CLI AI is an open source command line agent. You can write the code and make applications using Gemini CLI. It’s straight and free.
To install it, use the following command in your terminal.
npm install -g @google/gemini-cliAfter installing, use this code to start.
It will ask you to log into your Google account, and then you will see the screen where you will make this streaml.
![]()
![]()
Dashboard construction
We need to create a dashboard, we need to make an indication that is in accordance with your specific data and mission. In the following indicators, we explain everything needed to make AI a streamlit dashboard.
Build a Streamlit app for the California Real Estate dataset by using this dataset ( path-to-dataset )
Here is the dataset information:
• Domain: California housing — Los Angeles, San Francisco, San Diego, San Jose, Sacramento.
• Location: city, neighborhood, lat, lon, and dist_km_center (haversine to city center).
• Home features: sqft, beds, baths, lot_sqft, year_built, property_type, has_garage, condition.
• Context: school_score, crime_index.
• Target: price_usd.
• Price logic: city premium + size + rooms + lot size + school/crime + distance to center + property type + condition + noise.
• Files you have: ca_housing_synth.csv (data) and real_estate_model.pkl (trained pipeline).
The Streamlit app should have:
• A short dataset overview section (shape, column list, small preview).
• Sidebar inputs for every model feature except the target:
- Categorical dropdowns: city, neighborhood, property_type, condition, has_garage.
- Numeric inputs/sliders: lat, lon, sqft, beds, baths, lot_sqft, year_built, school_score, crime_index.
- Auto-compute dist_km_center from the chosen city using the haversine formula and that city’s center.
• A Predict button that:
- Builds a one-row DataFrame with the exact training columns (order-safe).
- Calls pipeline.predict(...) from real_estate_model.pkl.
- Displays Estimated Price (USD) with thousands separators.
• One chart only: What-if: sqft vs price line chart (all other inputs fixed to the sidebar values).
- Quality of life: cache model load, basic input validation, clear labels/tooltips, English UI.Next, Gemini will ask for your permission to make this file.
![]()
![]()
Let’s approve and continue. Once this coding is over, it will automatically open a streamllate dashboard.
If not, go to the Working Directory app.py File and run streamlit run app.py To start this streamllate app.
Here is our streamlit dashboard.


Once you click on the data review, you can see a section that represents the data search.


With the properties of the left side, we can customize the property and make predictions accordingly. This part of the dashboard represents what we did in the model building, but with a more responsible look.
Let’s select Richmond, San Francisco, Single Family, excellent condition, 1500 square feet, and click the “Prediction Price” button:
![]()
![]()
The prediction cost is $ 1.24m. Also, you can see the original vs predictions in the second graph for the entire dataset once the scroll.
![]()
![]()
You can adjust more features in the left panel, such as built -in, crime index, or number of bathrooms.


!! Step 4: Apply the model
The next step is to upload your model in production. To do this, you can follow these steps:
. The final views
In this article, we have discovered different methods of making synthetic datases, such as random, rule -based, or AI -powered. Next, we have created a portfolio data project by starting with data research and construction of machine learning model.
We also used an open source command line -based AI agent (Gemini CLI) to produce a dashboard, which detects dataset and predicts home prices based on selected features, including the number of bedrooms, crime index and square footage.
Creating your artificial data allows you to move faster without avoiding privacy barriers, balances your examples and collecting expensive data. The negative aspect is that it can reflect your assumptions and lose real world prices. If you are looking for more inspiration, see this list Machine Learning Projects That you can compromise on your portfolio.
Finally, we saw how to upload your model to production using Streamlit Community Cloud. Go ahead and follow these steps for the construction and display of your Portfolio Project today!
Net Razii A data is in a scientist and product strategy. He is also an affiliated professor of Teaching Analytics, and is the founder of Stratskrich, a platform that helps data scientists prepare for his interview with the real questions of high companies. The net carrier writes on the latest trends in the market, gives interview advice, sharing data science projects, and everything covers SQL.