This guide helps you set up a clean, isolated environment to run the DataQuest tutorials. While many tutorials work fine directly on your computer, some (especially those that involve data processing tools like Payspark) may run into problems depending on your operating system or current software setup. The lab environment we’ll create ensures that everything runs consistently, with the right versions of Python and other tools, without affecting your main system.
What is the lab environment?
You can think of this “lab” as a separate workspace just for your DataQuest tutorials. This is a controlled space where you can experiment and test code without affecting your main computer. Just as scientists use labs for experiments, we will use this developmental lab to safely work through lessons.
Benefits for everyone:
- Windows/Mac users: Avoid errors due to system differences. No more “command not found” or unable to find Pyspark files
- Linux users: Get the exact Python and Java versions needed for the tutorial, without conflicting with your system packages
- all: Keep your lesson work separate from personal projects. Your code and files are usually saved, but any packages you install or system changes you make remain in the lab.
We will use a tool called Docker To create this isolated workspace. Think of it as having a dedicated computer for lessons just inside your regular computer. Your files and code are protected just like normal (you can edit them with your favorite editor), but the tutorial environment itself stays clean and separate from everything else on your system.
lab Command You create this environment, and it mirrors real data engineering workflows (most companies use such isolated environments to ensure consistency across their teams).
Installing Docker
Docker creates an isolated Linux environment on any operating system. This means you’ll have a consistent Linux workspace whether you’re on Windows, Mac, or even Linux. We are using it as a simple tool, so no container orchestration or cloud deployment knowledge is required.
On Windows:
Download Docker Desktop docker.com/products/docker-desktop. Run the installer, restart your computer when prompted, and open Docker Desktop. Running when you see the wheel icon in your system tray.
Note: Docker Desktop will automatically enable Windows features such as WSL 2. If you see an error about virtualization, you may need to enable it in your computer’s BIOS settings. Search online for your computer model + “Enable Virtualization” for specific steps.
On Mac:
Download Docker Desktop for your chip type (Intel or Apple Silicon) from the same link. Drag Docker into your Applications folder and launch it. You will see a wheel in your menu bar.
On Linux:
You probably already have Docker, but if not, run this command in your terminal:
curl -fsSL -o get-docker.sh && sh get-docker.shVerify that it works:
Open your terminal (PowerShell, Terminal, or Bash) and run:
docker --version
docker compose versionYou should see the version number for both commands. If you see “command not found”, restart your terminal or computer and try again.
Getting a lab environment
has already been created in the lab Data Quest tutorials storage. Clone or Download:
git clone
cd tutorialsIf you don’t have Git, download the repository as a zip file from GitHub and extract it.
The collection includes everything you need:
Dockerfile– Creates a Linux environment with Python 3.11 and Java (for Spark).docker-compose.yml– Describes the lab setup- Tutorial folders with all code and data
Starting your own lab
In your IDE’s terminal, make sure you are in tutorials folder and start the lab:
docker compose run --rm labNote that the first time you run this command, setup may take 2-5 minutes.
You are now in Linux! Your gesture will turn into something root@abc123:/tutorials#which is your Linux command line where everything will work as expected.
--rm The flag means the lab cleans itself when you exit, keeping your system clean.
Using your lab
Once you’re in the lab environment, your typical workflow is:
1. Go to the tutorial you are working on
# See all available tutorials
ls
# Enter a specific tutorial
cd pyspark-etl2. Install packages as required
Each tutorial may require different packages:
# For PySpark tutorials
pip install pyspark
# For data manipulation tutorials
pip install pandas numpy
# For database connections
pip install sqlalchemy psycopg2-binary3. Run the tutorial code
python .py
Since the code will run in a standard Linux environment, you shouldn’t run into setup errors.
4. Edit files normally
The beauty of this setup: you can still use your favorite editor! tutorials The folder on your computer is compatible with the lab. Edit the files in VS Code, Pycharm, or any editor, and Lab sees the changes instantly.
5. Exit when done
Kind of exit Or press CTRL+D to quit the lab. The environment automatically cleans itself.
Examples of typical workflows
Running a Pespark tutorial:
docker compose run --rm lab
cd pyspark-etl
pip install pyspark pandas
python main.py
Working with Jupiter Notebook:
docker compose run --rm -p 8888:8888 lab
pip install jupyterlab
jupyter lab --ip=0.0.0.0 --allow-root --no-browser
# Open the URL it shows in your browser
Installing packages between sessions:
If you’re tired of reinstalling packages, create a requirements file:
# After installing packages, save them
pip freeze > requirements.txt
# Next session, restore them
pip install -r requirements.txt
Quick reference
A command you need:
# From the tutorials folder
docker compose run --rm lab
Exit the lab:
exit # Or press Ctrl+D
Where things are:
- Tutorial Code: In each folder
/tutorials - Your edit: Auto sync
- Data files: In each lesson
data/Folder - Output files: Save the tutorial folder for viewing on your computer
Adding services (database, etc.):
For tutorials that require PostgreSQL, MongoDB, or other services, we can extend docker-compose.yml. For now, the base setup handles all Python and Pespark tutorials.
Troubleshooting
-
“Cannot connect to docker daemon”
- Docker Desktop is required to run. Start it with your applications.
-
“docker-compose” is not recognized
-
Slow performance on Windows
- Docker on Windows can be slow with large datasets. For better performance, store data files in a container instead of a mounted folder.
-
“Permission Denied” on Linux
-
Add your user to the docker group:
sudo usermod -aG docker $USERThen log out and back in.
-
You are ready
You now have a Linux lab environment that matches the production system. Happy experimenting!