Data Collection & Sourcing
Duration: 15 min
Data Collection & Sourcing
Types of Data
Structured Data
- Tables with defined schema (SQL databases)
- Spreadsheets, CSVs
- Easy to analyze but limited in scope
Unstructured Data
- Text, images, audio, video
- Flexible but harder to analyze
- Often requires preprocessing
Semi-Structured Data
- JSON, XML files
- APIs, logs
- Balance between flexibility and structure
Data Sources
Internal Data
- Customer databases (CRM, transaction logs)
- Product usage (clickstream, features)
- Operations (supply chain, HR)
- Financial records
External Data
- Public datasets (Kaggle, GitHub, UCI ML)
- APIs (Google, Twitter, weather services)
- Web scraping (with permission)
- Partnerships and data brokers
Generating Data
- Surveys and questionnaires
- A/B tests and experiments
- Simulations
- Sensor networks and IoT
Popular Data Sources
Free Datasets
- Kaggle (10,000+ datasets, competitions)
- UCI Machine Learning Repository
- Google Dataset Search
- GitHub awesome-datasets
APIs
- Weather: OpenWeatherMap
- Finance: Yahoo Finance, Alpha Vantage
- Social: Twitter, Reddit
- Maps: Google Maps
Research Data
- Academic papers (arXiv)
- Government databases (census, economic)
- Corporate reports (earnings, market analysis)
Data Collection Methods
Direct Collection
import pandas as pdFrom CSV
df = pd.read_csv('data.csv')From API
import requests
response = requests.get('https://api.example.com/data')
data = response.json()
df = pd.DataFrame(data)
Web Scraping
from bs4 import BeautifulSoup
import requestsresponse = requests.get('https://example.com')
soup = BeautifulSoup(response.content, 'html.parser')
Extract data from HTML
Databases
import sqlite3conn = sqlite3.connect('database.db')
query = "SELECT * FROM customers WHERE age > 25"
df = pd.read_sql(query, conn)
Streaming Data
- Apache Kafka for real-time data pipelines
- Pub/sub systems for event streaming
Data Volume Considerations
Small Data (< 1GB)
- Fit in memory easily
- Any laptop/tool works
- Quick to experiment
Medium Data (1-100 GB)
- May need optimization
- Tools: Pandas, scikit-learn
- Cloud storage recommended
Big Data (> 100 GB)
- Requires distributed systems
- Tools: Spark, Hadoop, Dask
- Professional infrastructure needed
Ethical & Legal Considerations
Data Privacy
- GDPR (Europe): Right to privacy, data deletion
- CCPA (California): Consumer rights
- HIPAA (Healthcare): Protected health information
- Ask: "Do I have permission to use this data?"
Bias Awareness
- Ensure data represents all groups fairly
- Understand historical biases in old data
- Document data limitations
Responsible Collection
- Get explicit consent where required
- Anonymize personal information
- Respect intellectual property
- Attribute data sources
Data Quality Issues
| Issue | Example | Solution | |-------|---------|----------| | Missing values | NULL in age column | Imputation or removal | | Duplicates | Same customer twice | Deduplication | | Outliers | Salary = $10 million | Investigation + handling | | Wrong format | Age as "twenty-five" | Parsing/conversion | | Inconsistency | "USA", "United States", "US" | Standardization |
Checklist: Before Using Data
- [ ] Do you have permission to use this data?
- [ ] Is the data recent enough for your problem?
- [ ] Do you understand what each column means?
- [ ] Are there obvious quality issues?
- [ ] Is the data representative of your target population?
- [ ] Are there privacy/ethical concerns?
Key Takeaways
✓ Data comes from internal systems, external sources, or is generated ✓ Choose collection method based on data type and scale ✓ Always consider privacy, ethics, and legal requirements ✓ Assess data quality early—garbage in, garbage out
Practice
Find one dataset you're interested in:
1. Visit Kaggle.com or UCI ML Repository
2. Download a dataset related to your interests
3. Load it with Pandas: pd.read_csv('filename.csv')
4. Use df.info() and df.describe() to explore it
5. Note 3-5 data quality issues you observe
---
Next: Cleaning & preparing data for analysis.