

Photo by editor
# Introduction
Data is at the core of any data professional’s work. Without useful and accurate data sources, we cannot fulfill our responsibilities. Furthermore, poor quality or irrelevant data will only cause us to waste our work. This is why having access to reliable datasets is an important starting point for data professionals.
Data Commons It is an open source initiative by Google to organize the world’s available data and make it accessible for use. It is free for anyone to query publicly available data. What sets DataCommons apart from other public dataset projects is that it already performs the schematic work, making the data ready to use more quickly.
Given the utility of data commons for our work, access to it is becoming critical for many data tasks. Fortunately, DataCommons provides a new Python API client to access these datasets.
# Accessing Data Commons with Python
Data Commons works by organizing data into queryable knowledge graphs that combine information from diverse sources. At its core, it uses a schema-based model schema.org Standardizing data representation.
Using this schema, data commons can combine data from different sources into a single graph where nodes represent entities (such as cities, places, and people), events, and data variables. The edges represent the relationships between these nodes. Each node is unique and identifiable by a DCID (Data Commons ID), and many nodes contain observations – measurements associated with variables, entities and periods.
With the Python API, we can easily access the Knowledge Graph to get the required data. Let’s try how we can do this.
First, we need to get a free one API key To access Data Commons. Create a free account and copy the API key to a safe location. You can also use Trial API keybut access is more limited.
Next, install the Data Commons Python library. We will use the V2 API client, as it is the most recent version. To do this, run the following command to install the DataCommons client with optional support for Pandas Also data frames.
pip install "datacommons-client(Pandas)"With the library installed, we are ready to fetch data using the DataCommons Python client.
To create a client that will access data from the cloud, run the following code.
from datacommons_client.client import DataCommonsClient
client = DataCommonsClient(api_key="YOUR-API-KEY")One of the most important concepts in data commons is the entity, which refers to a permanent and physical thing in the real world, such as a city or country. This becomes an important part of fetching data, as most datasets require defining entities. You can see Data Commons Place Page to know about all available institutes.
For most users, the data we want to retrieve is more specific: data variables stored in data commons. To select the data we want to retrieve, we need to know the DCID of the statistical variables, which you can find through Statistical Variable Explorer.
![]()
![]()
You can filter variables and select datasets from the above options. For example, choose the World Bank dataset for “ATMs per 100,000 adults.” In this case, you can get the DCID by checking the information provided in Explorer.
![]()
![]()
If you click on the DCID, you can see all the information about the node, including how it connects to other information.
![]()
![]()
For the statistical variable DCID, we also need to define the entity DCID for the geography. We can explore the Data Commons Place page above, or we can use the following code to see the available DCIDs for a particular place name.
# Look up DCIDs by place name (returns multiple candidates)
resp = client.resolve.fetch_dcids_by_name(names="Indonesia").to_dict()
dcid_list = (c("dcid") for c in resp("entities")(0)("candidates"))
print(dcid_list)with output like the following:
('country/IDN', 'geoId/...' , '...')Using the above code, we fetch the available DCID candidates for a given place name. For example, we can select “Indonesia” among the candidates country/IDN As the country DCID.
All the information we need is now ready, and all we need to do is execute the following code:
variable = ("worldBank/GFDD_AI_25")
entity = ("country/IDN")
df = client.observations_dataframe(
variable_dcids=variable,
date="all",
entity_dcids=entity
)The result is shown in the dataset below.
![]()
![]()
The current code returns all available observations for selected variables and entities over the entire time frame. In the above code, you will also notice that we are using lists instead of single strings.
This is because we can pass multiple variables and entities simultaneously to get a shared dataset. For example, the code below fetches two separate statistical variables and two entities at the same time.
variable = ("worldBank/GFDD_AI_25", "worldBank/SP_DYN_LE60_FE_IN")
entity = ("country/IDN", "country/USA")
df = client.observations_dataframe(
variable_dcids=variable,
date="all",
entity_dcids=entity
)with output like the following:
![]()
![]()
You can see that the resulting data frame combines variables and entities that you defined earlier. With this method, you can get the data you need without having to execute separate queries for each combination.
That’s all you need to know about accessing DataCommons with the new Python API client. Use this library whenever you need reliable public data for your work.
# wrap up
Data Commons is an open source project by Google that aims to democratize access to data. The project is inherently different from many public data projects, as datasets are built on top of a knowledge graph schema, making it easier to unify the data.
In this article, we explored how to access datasets within graphs using Python by retrieving statistical variables and entities to retrieve observations.
I hope this has helped!
Cornelius Yudhavijaya Data Science Assistant Manager and Data Writer. Working full-time at Allianz Indonesia, he likes to share Python and data tips through social media and written media. Cornelius writes on a variety of AI and machine learning topics.