https://www.youtube.com/watch?v=5bgr1ynlsyk
In this project walkthrough, we will find how to clean and analyze the real data of the survey using Star Wars and Pandas, while diving into the fascinating world of star warfare. By working with the results of the Five Treaty Survey, we will expose the viewer’s preferences, film ratings and insights about settlement trends that are clearly ahead.
Survey data analysis is an important skill for any data analyst. Unlike clean, structured datases, the survey response comes with unique challenges: contradictory formatting, mixed data types, checkbox responses that require strategic handling, and lost values that tell their story. This project faces these real -world challenges, and prepares you in your career -to -dirty dirty datases.
During this lesson, we will develop professional quality concepts that tell a compelling story about star warfare, which shows how proper data cleaning and thinking can transform the raw survey data into a stakeholder.
Why does this project make a difference
The survey analysis represents the applicable basic data science skills in industries. Whether you are analyzing customer satisfaction survey, employee engagement data, or market research, the technique shown here is the basis for professional data analysis.
- Data cleaning skills To handle the dirty, real world datasis
- Techniques of Bowlin conversion Survey check box answers.
- The settlement segment analysis To naked the group’s differences
- Professional Image. Design of the night For stakeholder presentations
- The synthesis of insights To translate data results in business intelligence
Star War theme makes learning pleasant, but this skill is directly transmitted to business context. Missing these techniques, and you will be ready to remove meaningful insights from any survey dataset that crosses your desk.
By the end of this tutorial, you will know how:
- Clear Dirt Survey Data by Maping of Yes/No Columns and Changing Checkbox Answers
- Handle unknown columns and make meaningful columns name for analysis
- Use Bowen Mapping Techniques to avoid data abuse when Japrot Cells reopen
- Calculate summary data and rating from the survey’s response
- Create professional -looking horizontal bar charts with custom styling
- Create comparative concepts for settlement analysis
- Apply Object -based metaplatlib for precise control over the appearance of the chart
- Offer clear, viable insights to stakeholders
Before you start: Pre -instruction
To take the maximum of this project, follow these initial steps:
Review the project
Access the project and familiarize yourself with goals and structures: Star Wars Survey Project
Access the solution notebook
You can see it here and download to see what we will cover: Solution notebook
Prepare your environment
- If you are using the Data Quest platform, everything has already been configured for you
- If working locally, make sure you have Pandas, Metaplatlib, and Nimp Install
- Download Datasit from Fywetti Et Gut Hub Repeatory
Provisions
- The main things of Azigar and comfortable with Pandas data frames
- Familiar with dictionaries, loops and methods in azagar
- Basic understanding of metaplatlib (we will use intermediate techniques)
- Survey data structure is helpful, but it is not required
New in Mark Dowan? We recommend learning the basics to format the header and adding context to your Japter notebook: Mark Dowan Guide.
To set your environment
Let’s start by importing the necessary libraries and loading your dataset:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
%matplotlib inline
Command is a magic that offers our plots directly in the notebook. This interactive data is essential for exploration work flu.
star_wars = pd.read_csv("star_wars.csv")
star_wars.head()
Our datasate includes more than 1,100 respondents surveying about their star -wise viewing habits and preferences.
Learning Visual: See unknown columns (unknown: 4, unknown: 5, etc.) and extremely tall column name? This survey is a special thing about the survey data exported from a platform like Maneki. The unknown columns actually represent different films in the franchise, and cleaning them will be our first big task.
Data Challenge: Survey structure defined
Survey data offers a unique structure challenge. Consider this common question of the survey:
“Which of the following star warfare films have you seen? Please select all the application.”
This checkbox -style question is exported as multiple columns where:
- Column 1 contains “Star Wars: Episode I Phantom Menis” if selected, if not, if not
- Column 2 contains “Star Wars: Clones Episode II attack”, if selected, if not
- And so for all six movies …
This structure makes the analysis difficult, so we will turn it into a clean bold columns.
The data cleaning process
Step 1: Bowlin to change yes/no reaction
Survey’s answers often come as text (“yes”/”no”) but bolin values (True
).).False
) Are very easy to work with the program:
yes_no = {"Yes": True, "No": False, True: True, False: False}
for col in (
"Have you seen any of the 6 films in the Star Wars franchise?",
"Do you consider yourself to be a fan of the Star Wars film franchise?",
"Are you familiar with the Expanded Universe?",
"Do you consider yourself to be a fan of the Star Trek franchise?"
):
star_wars(col) = star_wars(col).map(yes_no, na_action='ignore')
Learning Visual: Why seemingly useless? True: True, False: False
Entries? When it comes to re -operating the cells, it prevents overwriting data. Without these entries, if you accidentally run the cell twice, all your True
Will become values NaN
Because the mapping dictionary is no longer included True
As a key. This is a danger of a common gapter that can quietly eliminate your analysis!
Step 2: Changing Movie viewing data
The toughest section involves changing the data of the checkbox movie. Each unknown column represents whether someone has seen a special event of the star war:
movie_mapping = {
"Star Wars: Episode I The Phantom Menace": True,
np.nan: False,
"Star Wars: Episode II Attack of the Clones": True,
"Star Wars: Episode III Revenge of the Sith": True,
"Star Wars: Episode IV A New Hope": True,
"Star Wars: Episode V The Empire Strikes Back": True,
"Star Wars: Episode VI Return of the Jedi": True,
True: True,
False: False
}
for col in star_wars.columns(3:9):
star_wars(col) = star_wars(col).map(movie_mapping)
Step 3: Changing strategic column name
The names of the tall, incredible column make it difficult to analyze. We will rename their name changing things:
star_wars = star_wars.rename(columns={
"Which of the following Star Wars films have you seen? Please select all that apply.": "seen_1",
"Unnamed: 4": "seen_2",
"Unnamed: 5": "seen_3",
"Unnamed: 6": "seen_4",
"Unnamed: 7": "seen_5",
"Unnamed: 8": "seen_6"
})
We will also clear the classification columns:
star_wars = star_wars.rename(columns={
"Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.": "ranking_ep1",
"Unnamed: 10": "ranking_ep2",
"Unnamed: 11": "ranking_ep3",
"Unnamed: 12": "ranking_ep4",
"Unnamed: 13": "ranking_ep5",
"Unnamed: 14": "ranking_ep6"
})
Analysis: naked the data story
Which movie is supreme rule?
Let’s calculate the average rating for each film. Remember, in the classification questions, the low number indicates the higher the priority:
mean_ranking = star_wars(star_wars.columns(9:15)).mean().sort_values()
print(mean_ranking)
ranking_ep5 2.513158
ranking_ep6 3.047847
ranking_ep4 3.272727
ranking_ep1 3.732934
ranking_ep2 4.087321
ranking_ep3 4.341317
The results are decisive: Episode V (Empire Strikesback) emerges as clear fans’ favorite with an average rating of 2.51. The original trinity (IV-VI) significantly improves the Perucel triangle (episodes I-III).
Samples of Movie Viewers
What movies have people actually seen?
total_seen = star_wars(star_wars.columns(3:9)).sum()
print(total_seen)
seen_1 673
seen_2 571
seen_3 550
seen_4 607
seen_5 758
seen_6 738
Episodes guide V and VI in viewers, while especially showing the number of views. Episode III has the lowest audience among the 550 respondents.
Professional Concept: Ready to Stakeholder from Basic
Make our first chart
Let’s start with a basic concept and gradually increase it:
plt.bar(range(6), star_wars(star_wars.columns(3:9)).sum())
It creates a functional chart, but it is not ready for stakeholders. Let’s upgrade to Object -based metaplatlib for precise control:
fig, ax = plt.subplots(figsize=(6,3))
rankings = ax.barh(mean_ranking.index, mean_ranking, color='#fe9b00')
ax.set_facecolor('#fff4d6')
ax.set_title('Average Ranking of Each Movie')
for spine in ('top', 'right', 'bottom', 'left'):
ax.spines(spine).set_visible(False)
ax.invert_yaxis()
ax.text(2.6, 0.35, '*Lowest rank is the most\n liked', fontstyle='italic')
plt.show()
Learning Visual: Think about fig
As your canvas and ax
As a panel or chart area on this canvas. Object -based metaphotleb may seem initially scary, but it provides precise control over every visual element. fig
Objects handle the overall data features ax
Individual chart controls elements.
Advanced concept: gender comparison
Our most sophisticated scenario compares the classification and viewers by gender using side -by -side bars:
# Create gender-based dataframes
males = star_wars(star_wars("Gender") == "Male")
females = star_wars(star_wars("Gender") == "Female")
# Calculate statistics for each gender
male_ranking_avgs = males(males.columns(9:15)).mean()
female_ranking_avgs = females(females.columns(9:15)).mean()
male_tot_seen = males(males.columns(3:9)).sum()
female_tot_seen = females(females.columns(3:9)).sum()
# Create side-by-side comparison
ind = np.arange(6)
height = 0.35
offset = ind + height
fig, ax = plt.subplots(1, 2, figsize=(8,4))
# Rankings comparison
malebar = ax(0).barh(ind, male_ranking_avgs, color='#fe9b00', height=height)
femalebar = ax(0).barh(offset, female_ranking_avgs, color='#c94402', height=height)
ax(0).set_title('Movie Rankings by Gender')
ax(0).set_yticks(ind + height / 2)
ax(0).set_yticklabels(('Episode 1', 'Episode 2', 'Episode 3', 'Episode 4', 'Episode 5', 'Episode 6'))
ax(0).legend(('Men', 'Women'))
# Viewership comparison
male2bar = ax(1).barh(ind, male_tot_seen, color='#ff1947', height=height)
female2bar = ax(1).barh(offset, female_tot_seen, color='#9b052d', height=height)
ax(1).set_title('# of Respondents by Gender')
ax(1).set_xlabel('Number of Respondents')
ax(1).legend(('Men', 'Women'))
plt.show()
Learning Visual: Offset techniques (ind + height
) The key to making the supplement is repeated. This leads to women’s bars slightly down from male bars, which produces a comparative effect. The boundaries of the same axis ensure a fair visual comparison between the chart.
Key results and insights
Through our systematic analysis, we have discovered:
Movie’s preferences:
- Episode v (Empire Straitsback) emerges as the final fans’ favorite in all population items
- The original trinity rating and the audience significantly improve the prevalence
- Episode III receives the lowest rating and has the lowest viewer
Gender analysis:
- Both men and women rank Episode V as their obvious favorite
- Gender differences in priorities are minimal but in favor of permanently masculine engagement
- Men used to rank a bit more event than female
- More men have seen each of the six movies than women, but the samples remain permanent
Distinguished Visual:
- Most films are not equal to the differences between gender
- Epishes V and VI represent the most global appellate content of the VI franchise
- Scientific concepts about gender preferences in science fi show some support in the engagement level, but the taste preferences remain the same.
Summary of the stakeholder
Each analysis should end with clear, viable insights. Stakeholders here need to know:
- Episode V (Empire Strikesback) is the final fans’ favorite With average rating in all settlements
- Gender differences in film preferences are minimalChallenging ordinary stereotypes about science -fi priorities
- The original trinity improves the prevalent Critical welcome and arrive at both the audience
- Male respondents demonstrate high overall engagement With franchise, on average watching more movies
Beyond this analysis: next steps
This datastate is worth looking for extra dimensions:
- Character analysis: Which characters are loved, hated, or are controversial in the fan base?
- “Han shot first” discussion: Analyze this infamous star war dispute and what it appears about
- Cross franchise’s preferences: Detect the concrete between Star Wars and Star Track Fandem
- The concrete of education and age: Do the samples of viewing vary from gender to the distinctive factors?
This project balances the development of technical skills with engaging topics. You will emerge with a polished portfolio piece of data cleaning skills, advanced concepts. The ability to transform the data capabilities, and dirty survey data into viable business insights.
Whether you are a team or Seth, data tells a compelling story. And now you have the skill of telling it beautifully.
If you let this project go, please share your results in this Data Quest Community And tag me (Anna_strahl). I would love to see which samples you discover!
More plans to try
We have some more project walkthrough tutorials you can enjoy: