Whether you are transferring data between APIs or just producing JSON data for import, matching schemes can break your workflow. Learning JSON data to clear and normalize how to make a smooth, error -free data transfer.
This tutorial shows that how to clear the dirt JSON and export the results to a new file, based on a predetermined scheme. The JSON file we will clean has a data of 200 artificial customer records.
In this tutorial, we will apply two ways to clear input data:
With pure tiger
With
pandas
You can apply any of them in your code. But pandas
The method for large, complex data sets is better. Let’s jump straight into this process.
Here’s what we will cover is:
Provisions
You should have a basic understanding of this tutorial as well:
Dictionaries, lists and loop
JSON data structure (keys, values, and nests)
How to Read and Write JSON Files with Uzar
json
Module
Add and inspect the JSON file
Before starting writing any code, make sure that .Jusion The file you want to clean is in your project directory. This makes it easier to load into your script using just the file name.
Now you can inspect the file locally by looking locally or loading it into your script. json
Module
Here (to assume the file name “OLD_CUSTOMERS.JSON”::
This shows you whether the JSON file is formed as a dictionary or a list. It also prints the entire file in your terminal. Mine is a dictionary that maps 200 users’ entries. Looking closely at its structure and scheme, you should always open a raw JSON file in your IDE.
Explain the target scheme
If someone asks to clear the JSON data, this may mean that The current scheme It is not suitable for its desired purpose. At this point, you want to make it clear that the final JSON should look like export.
JSON Scheme is mainly a blueprint that describes:
Wanted fields
Names of the field
Data type for each field
Standard formats (for example, small emails, trimmed White Space etc.)
Here the old scheme looks like a target scheme:
As you can see, the goal is to delete ”customer_id”
And ”address”
Change fields and rest names in each entry:
”name”
to”full_name”
”email”
to”email_address”
”phone”
to”mobile”
”membership_level”
to”tier”
Output should consist of 4 response fields instead of 6, all are named to meet the project requirements.
How to clear JSON data with purezer
Let’s discover the built -in json
Module to align raw data with default schemes.
Step 1: Imported json
And time
Modules
Export json
It is important because we are working with JSON files. But we will use time
To find out how long the data cleaning process takes place.
import json
import time
Step 2: Load with the file json.load()
start_time = time.time()
with open('old_customers.json') as file:
crm_data = json.load(file)
Step 3: Write a function to loop and clean each customer entry in the dictionary
def clean_data(records):
transformed_records = ()
for customer in records("customers"):
transformed_records.append({
"full_name": customer("name"),
"email_address": customer("email"),
"mobile": customer("phone"),
"tier": customer("membership_level"),
})
return {"customers": transformed_records}
new_data = clean_data(crm_data)
clean_data()
Takes into the original data (Temporarily) Safe in the record variable, changing it to meet our target scheme.
Since the JSON file we have a packed dictionary in which A ”customers”
The key, which maps the customer entry list, we access this key and reach the loop through each entry contained in the list.
In the loop for loop, we rename the relevant fields and store clean entries in a new list called ”transformed_records”
.
Then, we loot with the dictionary ”customers”
The key is intact.
Step 4: Save the output in a .json file
Decide about your cleaned JSON data name and assign it to one output_file
Variable, like:
output_file = "transformed_data.json"
with open(output_file, "w") as f:
json.dump(new_data, f, indent=4)
You can also add print()
The statement below this block to confirm that the file has been saved in your project directory.
Step 5: Time data cleaning process
At the beginning of this process, we imported the time module to measure how long it takes to clear the JSON data using purezer. LOW WE to track the run time, we saved the current time in a start_time
Variable before cleaning, and now we will add a end_time
Variable at the end of the script.
The difference between end_time
And start_time
Values provide you with a total run time in seconds.
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Transformed data saved to {output_file}")
print(f"Processing data took {elapsed_time:.2f} seconds")
Here is how long it took to the data cleaning process with a purezer approach:
How to clear JSON data with pandas
Now we are going to try to get the results as soon as the above, said and used to use a third -party library. pandas
. Pandas is an open source library used for manipulation and analysis of data.
To start, you need to install Pandas Library in your directory. In your terminal, drive:
pip install pandas
Then follow these steps:
Step 1: Import relevant libraries
import json
import time
import pandas as pd
Step 2: Load file and extract customer entries
Unlike the purezer procedure, where we easily configured the keynote ”customers”
To access the customer data list, to work pandas
Need a slightly different approach.
We must remove the list before loading the data frame because pandas
Expects structural data. The list of customer dictionaries ensures that we only separate and clean the relevant records, preventing errors due to nest or unrelated JSON data.
start_time = time.time()
with open('old_customers.json', 'r') as f:
crm_data = json.load(f)
clients = crm_data.get("customers", ())
Step 3: Load customer entries into the data frame
Once you get a clean list of customer dictionaries, load the list in the data frame and assign its list to variables, such as:
df = pd.DataFrame(clients)
It produces a structure like a tabler or a spreadsheet, where each row represents a user. Loading the list in the data frame allows you to access too pandas
‘Powerful ways of cleaning data such as:
drop_duplicate()
: Data frame removes duplicate rows or entriesdropna()
: Drops in rows with any lost or banned figuresfillna(value)
: All lost or banned data changes with a specified valuedrop(columns)
: Unused columns fall clearly
Step 4: Write a custom function to rename the relevant departments
At this point, we need a function that enters the same customer – a row – and returns a clean version that according to the target scheme (“full_name”
For, for, for,. “email_address”
For, for, for,. “mobile”
And “tier”
,
The function should also handle the lost data by configuring the default values “Unknown” Or “N/a” When a field is absent.
Ps: First, I used drop(columns)
To clearly remove “address”
And “customer_id”
Fields. But in this case it is not needed, as transform_fields()
The function only selects and nominates the desired fields. Any additional columns are automatically excluded from clean data.
Step 5: Apply Skima Change in All Rows
We will use pandas
‘ apply()
How to apply our customs function in each row in the data frame. It will create a series (for example, 0 {…}, 1 → {……, 2 → {…}), which is not JSON friendly.
As if json.dump()
Expects a list, not Pandas series, we will apply tolist()
Converting the series into a list of dictionaries.
transformed_df = df.apply(transform_fields, axis=1)
transformed_data = transformed_df.tolist()
Another way to reach it is with the understanding of the list. To use instead of apply()
Of course, you can write:
transformed_data = (transform_fields(row) for row in df.to_dict(orient="records"))
orient=”records”
There is an argument for df.to_dict
Which tells Pandas to transform the data frame into a dictionary list, where each dictionary represents a single customer record (ie a row).
Again For a loop Calling the customs function on each row, repetitions through every customer record in the list. Finally, the list of the list ((…)) Clean rows collect in a new list.
Step 6: Save the output in a .json file
output_data = {"customers": transformed_data}
output_file = "applypandas_customer.json"
with open(output_file, "w") as f:
json.dump(output_data, f, indent=4)
I suggest you choose a different file name for you pandas
Output you can inspect both files as well as to find out if this output is similar to the cleaning results.
Step 7: Track the run time
Once again, check the difference between the start and the end time to determine the program execution time.
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Transformed data saved to {output_file}")
print(f"Processing data took {elapsed_time:.2f} seconds")
When I used to use List of understanding To apply the customs function, my script was the run time 0.03 secondsBut with pandas
‘ apply()
Function, fell yesterday’s run time 0.01 seconds.
Final Output Preview:
If you follow this tutorial closely, your JSON output should look like this – whether you used pandas
Method or purely auspicious view:
How to correct the cleaned JSON
Verification of your output ensures that cleaned data follows the expected structure before being used or combined. The move helps to catch formatting errors, lost fields and types of wrong data quickly.
Below are steps to verify your cleaned JSON file:
Step 1: Install and import jsonschema
jsonschema
There is a third -party verification library for Azigar. It helps you explain the anticipated structure of your JSON data and automatically checks whether your output is similar to this structure.
In your terminal, drive:
pip install jsonschema
Import the desired libraries:
import json
from jsonschema import validate, ValidationError
validate()
Check whether your JSON data is similar to the rules described in your scheme. If the data is correct, nothing happens. But if there is an error – such as a lost field or incorrect data type – it increases ValidationError
.
Step 2: Explain a scheme
As you know, JSON scheme changes with each file structure. If your JSON data is different from whom we are now working, learn how to make schemes Here. Otherwise, the scheme below describes the structure we expect we expect for our clean JSON:
schema = {
"type": "object",
"properties": {
"customers": {
"type": "array",
"items": {
"type": "object",
"properties": {
"full_name": {"type": "string"},
"email_address": {"type": "string"},
"mobile": {"type": "string"},
"tier": {"type": "string"}
},
"required": ("full_name", "email_address", "mobile", "tier")
}
}
},
"required": ("customers")
}
Data is a item that should be a key:
"customers"
."customers"
Should be a Heavy (A list), representing a customer entry with each item.Each user’s enrollment must have four fields.
"full_name"
"email_address"
"mobile"
"tier"
"required"
Fields make sure that none of the relevant fields are missing in any customer record.
Step 3: Load the cleared JSON file
with open("transformed_data.json") as f:
data = json.load(f)
Step 4: Correct the data
This step ls Wee, we will use A try. . . except
Block the process safely to finish, and if the code increases, display a helpful message ValidationError
.
try:
validate(instance=data, schema=schema)
print("JSON is valid.")
except ValidationError as e:
print("JSON is invalid:", e.message)
Pandas vs Purezer for Data Cleaning
From this tutorial, you may be able to say that cleaning JSON and using purezer for reorganization is a more straightforward approach. It is fast and ideal to handle small datases or simple changes.
But as the data increases and becomes more complicated, you may need modern data cleaning methods that are not just providing. In such cases, pandas
Becomes a better choice. It effectively handles large, complex datases, which provide built -in in hand -handling to handle the lost data and remove duplicate.
You can study Pandas chat sheet Knowing more methods of manipulation in the data.