Data Collection & Sourcing

Types of Data

Structured Data

Tables with defined schema (SQL databases)

Spreadsheets, CSVs

Easy to analyze but limited in scope

Unstructured Data

Text, images, audio, video

Flexible but harder to analyze

Often requires preprocessing

Semi-Structured Data

JSON, XML files

APIs, logs

Balance between flexibility and structure

Data Sources

Internal Data

Customer databases (CRM, transaction logs)

Product usage (clickstream, features)

Operations (supply chain, HR)

Financial records

External Data

Public datasets (Kaggle, GitHub, UCI ML)

APIs (Google, Twitter, weather services)

Web scraping (with permission)

Partnerships and data brokers

Generating Data

Surveys and questionnaires

A/B tests and experiments

Simulations

Sensor networks and IoT

Popular Data Sources

Free Datasets

Kaggle (10,000+ datasets, competitions)

UCI Machine Learning Repository

Google Dataset Search

GitHub awesome-datasets

APIs

Weather: OpenWeatherMap

Finance: Yahoo Finance, Alpha Vantage

Social: Twitter, Reddit

Maps: Google Maps

Research Data

Academic papers (arXiv)

Government databases (census, economic)

Corporate reports (earnings, market analysis)

Data Collection Methods

Direct Collection

import pandas as pd
From CSV
df = pd.read_csv('data.csv')
From API
import requests
response = requests.get('https://api.example.com/data')
data = response.json()
df = pd.DataFrame(data)

Web Scraping

from bs4 import BeautifulSoup
import requests
response = requests.get('https://example.com')
soup = BeautifulSoup(response.content, 'html.parser')
Extract data from HTML

Databases

import sqlite3conn = sqlite3.connect('database.db')
query = "SELECT * FROM customers WHERE age > 25"
df = pd.read_sql(query, conn)

Streaming Data

Apache Kafka for real-time data pipelines

Pub/sub systems for event streaming

Data Volume Considerations

Small Data (< 1GB)

Fit in memory easily

Any laptop/tool works

Quick to experiment

Medium Data (1-100 GB)

May need optimization

Tools: Pandas, scikit-learn

Cloud storage recommended

Big Data (> 100 GB)

Requires distributed systems

Tools: Spark, Hadoop, Dask

Professional infrastructure needed

Ethical & Legal Considerations

Data Privacy

GDPR (Europe): Right to privacy, data deletion

CCPA (California): Consumer rights

HIPAA (Healthcare): Protected health information

Ask: "Do I have permission to use this data?"

Bias Awareness

Ensure data represents all groups fairly

Understand historical biases in old data

Document data limitations

Responsible Collection

Get explicit consent where required

Anonymize personal information

Respect intellectual property

Attribute data sources

Data Quality Issues

| Issue | Example | Solution | |-------|---------|----------| | Missing values | NULL in age column | Imputation or removal | | Duplicates | Same customer twice | Deduplication | | Outliers | Salary = $10 million | Investigation + handling | | Wrong format | Age as "twenty-five" | Parsing/conversion | | Inconsistency | "USA", "United States", "US" | Standardization |

Checklist: Before Using Data

[ ] Do you have permission to use this data?

[ ] Is the data recent enough for your problem?

[ ] Do you understand what each column means?

[ ] Are there obvious quality issues?

[ ] Is the data representative of your target population?

[ ] Are there privacy/ethical concerns?

Key Takeaways

✓ Data comes from internal systems, external sources, or is generated ✓ Choose collection method based on data type and scale ✓ Always consider privacy, ethics, and legal requirements ✓ Assess data quality early—garbage in, garbage out

Practice

Find one dataset you're interested in: 1. Visit Kaggle.com or UCI ML Repository 2. Download a dataset related to your interests 3. Load it with Pandas: pd.read_csv('filename.csv') 4. Use df.info() and df.describe() to explore it 5. Note 3-5 data quality issues you observe

---

Next: Cleaning & preparing data for analysis.