Data Collection & Sourcing

Duration: 15 min

Data Collection & Sourcing

Types of Data

Structured Data

  • Tables with defined schema (SQL databases)
  • Spreadsheets, CSVs
  • Easy to analyze but limited in scope

Unstructured Data

  • Text, images, audio, video
  • Flexible but harder to analyze
  • Often requires preprocessing

Semi-Structured Data

  • JSON, XML files
  • APIs, logs
  • Balance between flexibility and structure

Data Sources

Internal Data

  • Customer databases (CRM, transaction logs)
  • Product usage (clickstream, features)
  • Operations (supply chain, HR)
  • Financial records

External Data

  • Public datasets (Kaggle, GitHub, UCI ML)
  • APIs (Google, Twitter, weather services)
  • Web scraping (with permission)
  • Partnerships and data brokers

Generating Data

  • Surveys and questionnaires
  • A/B tests and experiments
  • Simulations
  • Sensor networks and IoT

Popular Data Sources

Free Datasets

  • Kaggle (10,000+ datasets, competitions)
  • UCI Machine Learning Repository
  • Google Dataset Search
  • GitHub awesome-datasets

APIs

  • Weather: OpenWeatherMap
  • Finance: Yahoo Finance, Alpha Vantage
  • Social: Twitter, Reddit
  • Maps: Google Maps

Research Data

  • Academic papers (arXiv)
  • Government databases (census, economic)
  • Corporate reports (earnings, market analysis)

Data Collection Methods

Direct Collection

import pandas as pd

From CSV

df = pd.read_csv('data.csv')

From API

import requests response = requests.get('https://api.example.com/data') data = response.json() df = pd.DataFrame(data)

Web Scraping

from bs4 import BeautifulSoup
import requests

response = requests.get('https://example.com') soup = BeautifulSoup(response.content, 'html.parser')

Extract data from HTML

Databases

import sqlite3

conn = sqlite3.connect('database.db') query = "SELECT * FROM customers WHERE age > 25" df = pd.read_sql(query, conn)

Streaming Data

  • Apache Kafka for real-time data pipelines
  • Pub/sub systems for event streaming

Data Volume Considerations

Small Data (< 1GB)

  • Fit in memory easily
  • Any laptop/tool works
  • Quick to experiment

Medium Data (1-100 GB)

  • May need optimization
  • Tools: Pandas, scikit-learn
  • Cloud storage recommended

Big Data (> 100 GB)

  • Requires distributed systems
  • Tools: Spark, Hadoop, Dask
  • Professional infrastructure needed

Ethical & Legal Considerations

Data Privacy

  • GDPR (Europe): Right to privacy, data deletion
  • CCPA (California): Consumer rights
  • HIPAA (Healthcare): Protected health information
  • Ask: "Do I have permission to use this data?"

Bias Awareness

  • Ensure data represents all groups fairly
  • Understand historical biases in old data
  • Document data limitations

Responsible Collection

  • Get explicit consent where required
  • Anonymize personal information
  • Respect intellectual property
  • Attribute data sources

Data Quality Issues

| Issue | Example | Solution | |-------|---------|----------| | Missing values | NULL in age column | Imputation or removal | | Duplicates | Same customer twice | Deduplication | | Outliers | Salary = $10 million | Investigation + handling | | Wrong format | Age as "twenty-five" | Parsing/conversion | | Inconsistency | "USA", "United States", "US" | Standardization |

Checklist: Before Using Data

  • [ ] Do you have permission to use this data?
  • [ ] Is the data recent enough for your problem?
  • [ ] Do you understand what each column means?
  • [ ] Are there obvious quality issues?
  • [ ] Is the data representative of your target population?
  • [ ] Are there privacy/ethical concerns?

Key Takeaways

✓ Data comes from internal systems, external sources, or is generated ✓ Choose collection method based on data type and scale ✓ Always consider privacy, ethics, and legal requirements ✓ Assess data quality early—garbage in, garbage out

Practice

Find one dataset you're interested in: 1. Visit Kaggle.com or UCI ML Repository 2. Download a dataset related to your interests 3. Load it with Pandas: pd.read_csv('filename.csv') 4. Use df.info() and df.describe() to explore it 5. Note 3-5 data quality issues you observe

---

Next: Cleaning & preparing data for analysis.