April 1, 2025

Hello! Today we’ll begin our final projects, which will end in us making some nice infographics and trying to tell a story about one of these datasets.

Introduction

To start with, sit within your groups and discuss which of these datasets you’d like to work with. Each group will work on one dataset (yes, some groups might choose the same one).

When you’re in the process of choosing a dataset, think about what in the dataset interests you. Is the topic of interest? Can you research a little about it to add something to your story? Are there interesting visual possibilities that you can think of (like for our movies data)?

Having chosen a dataset, you will have to:

  • Explore the dataset a little
  • Think of 2-3 questions and use GROUP BY or FILTER summaries to answer them.
  • Turn those summaries into beautiful visual representations. This can happen with a tool of your choice (Datawrapper, Rawgraphs, Google Sheets) and even be artistic.
  • The output will be laid out in an A2-sized infographic with supporting text, visuals, graphics and your visualizations. Think of executing it like you made your movie visualizations.

Analysis is a small part of this exercise, and you are free to try exploring interesting questions but also remember that the output is a well-designed poster telling the story of this dataset, answering some of the questions you raised. Do not be afraid of Orange or creating summaries, that is a part of this process. Utilize your skillsets to make something beautiful that is driven by some numbers.

Datasets

Questions? Refer to the FAQs.

Spotify Songs

Analyze the danceability, energy or other variables of songs on Spotify. When was Justin Bieber’s most quiet phase? Did The Weeknd’s music get sadder over time? The possibilities are endless!

This dataset contains all sorts of juicy attributes about songs - everything from how danceable they are to whether they make you feel happy or sad. Spotify uses machine learning algorithms to generate these attributes, assigning numerical values to qualities we usually think of as subjective.

Possible questions

  • Have songs gotten happier or sadder over time?
  • Which genres have the most danceable tracks?
  • Are more popular songs typically louder?
  • How do different artists’ musical signatures compare? (Like Taylor Swift vs. Drake on energy or acousticness)
  • Can we spot trends in song length across decades or genres?
VariableData TypeExplanation
track_idCategoricalA unique code that identifies each song (like a social security number for songs)
track_nameCategoricalThe title of the song
track_artistCategoricalWho performed the song
track_album_idCategoricalA unique code for the album
track_album_nameCategoricalThe name of the album the song is on
track_album_release_dateCategoricalWhen the album was released
playlist_nameCategoricalThe name of the playlist containing the song
playlist_idCategoricalA unique code for the playlist
playlist_genreCategoricalThe main music style of the playlist (rock, pop, etc.)
playlist_subgenreCategoricalA more specific music style within the genre
danceabilityNumeric (0.0-1.0)How good the song is for dancing (1.0 = perfect for dancing)
energyNumeric (0.0-1.0)How energetic the song feels (1.0 = very high energy)
keyNumeric (0-11)What musical key the song is in (0 = C, 1 = C♯/D♭, etc.)
loudnessNumeric (dB)How loud the song is (typically between -60 and 0)
modeNumeric (0 or 1)Whether the song is in a minor (0) or major (1) key
speechinessNumeric (0.0-1.0)How much spoken word is in the song (high = more talking)
acousticnessNumeric (0.0-1.0)How acoustic the song is (1.0 = definitely acoustic)
instrumentalnessNumeric (0.0-1.0)How likely the song has no vocals (1.0 = instrumental)
livenessNumeric (0.0-1.0)How likely the song was recorded live (0.8+ = probably live)
valenceNumeric (0.0-1.0)How positive/happy the song sounds (1.0 = very happy)
tempoNumeric (BPM)How fast the song is in beats per minute
duration_msNumericHow long the song is in milliseconds (divide by 1000 for seconds)

Titanic Passengers

This dataset lets us explore who survived and who didn’t during this maritime disaster.

Remember our class example with the mosaic plot? Now you can dive deeper and create your own analysis. Fascinating possibilities for visuals too.

Possible Questions

  • Did “women and children first” really happen?
  • How much did your ticket class affect your chances of survival?
  • Did traveling with family help or hurt your chances?
VariableData TypeExplanation
pclassCategorical (1,2,3)Passenger class - 1st, 2nd, or 3rd class ticket
survivedCategorical (0,1)Whether the passenger survived (1) or died (0)
nameCategoricalPassenger’s full name, often including title (Mr., Mrs., etc.)
sexCategoricalPassenger’s gender (male or female)
ageNumericPassenger’s age in years (fractional for children under 1)
sibspNumericNumber of siblings/spouses the passenger had aboard
parchNumericNumber of parents/children the passenger had aboard
ticketCategoricalTicket number
fareNumericHow much the passenger paid for their ticket (in pounds)
cabinCategoricalCabin number (many are missing)
embarkedCategoricalPort where passenger boarded (C = Cherbourg, Q = Queenstown, S = Southampton)
boatCategoricalLifeboat number (if survived)
bodyNumericBody identification number (if not survived and body recovered)
home.destCategoricalHome or destination location

Global Human Day

This dataset shows how people around the world spend their time each day - capturing what the average day looks like across different countries. It’s a 24-hour snapshot of humanity!

Possible Questions

  • Which countries have the best work-life balance?
  • Where do people spend the most time on leisure?
  • How does childcare time vary across the world?
  • Is there a connection between a country’s wealth and how people spend their time?
  • Which regions get the most sleep?
VariableData TypeExplanation
CategoryCategoricalMain category of time use activities (from M24 classification system)
SubcategoryCategoricalMore specific subcategory of time use activities (from M24 classification)
country_iso3CategoricalThree-letter country code that uniquely identifies each country (ISO standard)
region_codeCategoricalCode that identifies which geographical region the country belongs to
populationNumericTotal number of people living in the country
hoursPerDayCombinedNumericAverage number of hours per day spent on this activity in this country
uncertaintyCombinedNumericStatistical measurement of the variance or uncertainty in the time use data

PLEASE IGNORE THE UNCERTAINTYCOMBINED VARIABLE, it is not useful in our case.

Global Student-Teacher Ratio

The student-teacher ratio is a great proxy for educational investment. Lower ratios (fewer students per teacher) generally mean more resources devoted to education.

This dataset spans multiple years and education levels, letting you see how things have changed over time. Are countries improving their education systems, or are classrooms getting more crowded?

Possible Questions

  • Which countries have the lowest (best) student-teacher ratios?
  • Has the global situation improved or gotten worse over time?
  • How do primary education ratios compare to higher education?
  • Are there regional patterns in educational resourcing?
  • Is there a relationship between a country’s wealth and its education investment?
VariableData TypeExplanation
edulit_indCategoricalUnique identifier code for each education indicator record
indicatorCategoricalEducation level being measured (e.g., “Primary Education”, “Tertiary Education”)
country_codeCategoricalStandard code that uniquely identifies each country
countryCategoricalFull name of the country
yearNumericYear when the data was collected
student_ratioNumericNumber of students per teacher (lower values mean fewer students per teacher)
flag_codesCategoricalCodes that indicate special conditions or exceptions in the data collection
flagsCategoricalDetailed explanation of any exceptions, special circumstances, or data limitations

Proposed Outline

Tuesday

Morning: Question Formulation & Dataset Selection

  • Define clear research questions for your chosen dataset
  • Explore datasets and select the one that interests you most

Afternoon: Initial Data Exploration

  • Load dataset into Orange
  • Examine data structure using Feature Statistics
  • Create first exploratory visualizations to identify patterns
  • Begin sketching poster concepts based on initial findings

Wednesday

Morning: Data Summarization & Design

  • Use Group By to create meaningful summaries
  • Export your summarized dataset
  • Start designing your poster layout
  • Select color schemes and typography

Afternoon: Visualization Creation

  • Build your visualizations in Datawrapper/Rawgraphs
  • Incorporate visualizations into your poster design
  • Add preliminary titles, annotations, and context

Thursday

Morning: Design Refinement

  • Polish your visualizations
  • Improve titles and annotations
  • Ensure your poster tells a clear story

Afternoon: Final Production & Presentation

  • Complete final design adjustments
  • Produce final poster