Case Study Walkthrough: Turning Unstructured Survey Data into Community Health Insights
Project: Community Health Insights Dashboard Stack: Python (spaCy, Scikit-learn), Azure ML, Power BI Client: Public sector agency (anonymized for confidentiality) Goal: Convert open-ended resident survey data + public feedback into actionable insights for policy decision-making.
Step 1: Collecting the Data
Sources:
Challenge: These were all unstructured and filled with typos, slang, and overlapping topics (e.g., “rats,” “mold,” and “food storage” all related to housing conditions).
Step 2: Preprocessing with Python
To clean and prepare the data, I built a pipeline using:
import spacy from sklearn.feature_extraction.text import TfidfVectorizer nlp = spacy.load("en_core_web_sm") def clean_text(doc): doc = nlp(doc.lower()) tokens = [token.lemma_ for token in doc if not token.is_stop and token.is_alpha] return " ".join(tokens)
Step 3: Classifying Themes with NLP
Used custom multi-label classification with Scikit-learn to tag themes like:
Model training included:
from sklearn.pipeline
import Pipeline from sklearn.linear_model
import Logistic Regression from sklearn.multiclass
import OneVsRestClassifier
pipeline = Pipeline([ ('tfidf', TfidfVectorizer(max_df=0.8, min_df=5)), ('clf', OneVsRestClassifier(LogisticRegression())) ])
Evaluation metrics:
📊 Step 4: Bringing It to Life in Power BI
The structured output (themes + location + sentiment) was exported into a SQL backend feeding a live Power BI dashboard. The dashboard allowed decision-makers to:
🔥 Impact
✅ Reduced manual review time by 80% ✅ Enabled real-time insight reporting for 5+ city departments ✅ Informed funding allocation for new health programs and outreach teams