Predicting Isolated & Reclusive Youth in Seoul

NIA Data Analyst Training Program (Dataitgirls, 2023) — Team Project
Award: NIA Director's Award (과학기술정보통신부 산하 한국지능정보사회진흥원 원장상)

Project Summary

South Korea faces a growing crisis of isolated and reclusive youth (고립은둔청년) — an estimated 540,000 young adults (age 19–39) who withdraw from social participation for 6+ months. Depression among youth has doubled in 5 years; youth suicide averages 4.3 per day. Despite increasing government attention, the characteristics that define and predict this population remain vaguely defined, and most support systems rely on self-reporting — meaning those who need help the most are the least likely to ask for it.

This project analyzed the Seoul Metropolitan Government's Isolated/Reclusive Youth Survey to build a predictive model and propose a shift from reactive to proactive intervention.

What We Did

Built classification models on survey data (5,513 respondents × 250 features) to identify isolated youth
Reduced 382 dummy features → 216 (top 90% importance) → 20 key predictive characteristics across 6 categories (demographic, emotional, economic, behavioral, health, relational)
Analyzed 5,830 YouTube comments on isolation-related videos using text classification to find real isolated youth in the wild
Text mining pipeline (KoBERT sentiment analysis, TF-IDF, KoNLPy/Okt preprocessing)

Key Findings

Survey model (objective responses):

Naive Bayes: Accuracy 0.94, F1 0.70, ROC-AUC 0.96 (after correlation-based feature selection)
XGBoost with HyperOpt: Accuracy 0.97, F1 0.98, ROC-AUC 0.99
Top predictive feature by far: isolation duration (고립기간). Followed by: no social interaction, relationship difficulties, mental illness, job loss

YouTube comment analysis (text responses):

From 5,830 comments, the model identified 25 individuals showing strong indicators of isolation
Isolated youth write in formal/polite language (88% used 존댓말 vs 31% in general comments)
Isolated youth write fact-based, neutral sentences (75% vs 36%)
Isolated youth and those close to them express anxiety — every comment classified as isolated-youth had the "anxiety" feature set to True

Multi-dimensional nature confirmed: No single category dominates — effective intervention requires addressing social, economic, psychological, and health dimensions simultaneously.

Models Used

Task	Models	Best Result
Survey classification (objective)	Naive Bayes, XGBoost	XGBoost: F1 0.98, ROC-AUC 0.99
Text classification (subjective)	Logistic Regression, Random Forest	Accuracy ~0.90, but F1 low (0.04–0.06) due to natural class imbalance
Sentiment analysis	KoBERT	Emotion labeling on YouTube comments

Note on text model performance: High accuracy but low F1 was expected — the real-world distribution of isolated youth in YouTube comments is naturally imbalanced. The team intentionally did not apply resampling, as the goal was to find actual isolated individuals in real data, not to optimize metrics.

My Contributions

This was a 7-person team project (Team BlueYouth). This repository contains only my direct contributions.

Area	Contribution
Project Direction	Proposed initial project ideas; structured the topic selection decision-making process
Team Infrastructure	Set up Notion workspace, established ground rules, check-in systems, and GTD-based task management
Data Collection	Sourced and gathered public datasets: Seoul survey data, demographic statistics (KOSIS), spatial data
Variable Selection & Analysis	Designed selection criteria for survey variables (Unit 2: sections A13–B12); documented rationale for each inclusion/exclusion decision → see `preprocessing/variable_selection.md`
Preprocessing Specifications	Wrote preprocessing specifications for assigned variables (B2–B12): defined missing value treatment strategy, conditional logic for survey skip patterns, eligibility validation rules. Implementation was done by the development lead.
Presentation Structure	Created the initial PPT draft, defined the presentation outline (accepted by the team), and led final editing
Presentation Script	Wrote the presentation script; co-led rehearsal with team leader
Team Branding	Designed team name "BlueYouth" (블루유스) logo concept

What I Did NOT Do (Other Team Members' Work)

ML model implementation (Naive Bayes, XGBoost, Random Forest, HyperOpt tuning)
YouTube comment crawling and text preprocessing code
Hyperparameter tuning and model evaluation

Repository Structure

├── README.md
├── data/
│   ├── raw/                          # Public datasets I collected
│   │   ├── 201812_202212_연령별인구현황_연간.csv
│   │   ├── 202310_202310_연령별인구현황_월간.csv
│   │   ├── 연령_및_성별_인구_....csv
│   │   └── 서울시_생활권계획_시설_공간정보.csv
│   └── processed/                    # Final preprocessed outputs (team output)
│       ├── preprocessing_final.xlsx
│       └── preprocessing_final_dummy.xlsx
├── preprocessing/
│   ├── variable_selection.md         # Variable selection criteria & decisions
│   └── preprocessing_guide_unit2.xlsx # Preprocessing specification sheet
└── presentation/
    └── final_presentation_KA_editing.pptx

Data Sources

Dataset	Source	License
Seoul Isolated/Reclusive Youth Survey	Seoul Metropolitan Government	Not redistributed — available via Seoul Open Data
Youth Socioeconomic Survey (2021)	Korea Institute for Health and Social Affairs	Not redistributed — via KIHASA
Population by Age (Annual/Monthly)	KOSIS (Statistics Korea)	Public domain (CC BY)
Spatial Facility Data (Parks)	Seoul Open Data	Public domain
YouTube comments (isolation-related videos, top 6 by comment count)	YouTube	Crawled for research purposes

Note: The core Seoul survey dataset is not included due to redistribution restrictions.

Tech Stack

Python (pandas), Google Colab
Notion (project management)
PowerPoint (presentation — my editing)

Full team stack also included: scikit-learn, XGBoost, HyperOpt, KoNLPy (Okt), KoBERT, matplotlib, wordcloud. These were used by other team members for model implementation.

Context

Program: SW Women Data Analyst Training, Dataitgirls 7th (과학기술정보통신부 / Ministry of Science and ICT → NIA 한국지능정보사회진흥원), 2023
Team: BlueYouth (블루유스) — 7 members
Duration: Oct–Nov 2023 (~4 weeks)
Result: NIA Director's Award (한국지능정보사회진흥원 원장상)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting Isolated & Reclusive Youth in Seoul

Project Summary

What We Did

Key Findings

Models Used

My Contributions

What I Did NOT Do (Other Team Members' Work)

Repository Structure

Data Sources

Tech Stack

Context

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
preprocessing		preprocessing
presentation		presentation
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Predicting Isolated & Reclusive Youth in Seoul

Project Summary

What We Did

Key Findings

Models Used

My Contributions

What I Did NOT Do (Other Team Members' Work)

Repository Structure

Data Sources

Tech Stack

Context

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages