Web Scraping with Python

Duration: 8 min

This module delves into the art of web scraping using Python, a crucial skill for AI developers looking to gather data from the web. We'll explore libraries such as BeautifulSoup and requests, and understand how to navigate and extract information from HTML and XML documents. This skill is essential for training machine learning models and data analysis.

Understanding HTTP Requests

Web scraping begins with making HTTP requests to retrieve the content of a web page. Python's 'requests' library simplifies this process, allowing us to send HTTP requests and handle responses. It's important to understand headers, status codes, and how to handle different types of content.

example1.py

import requests

url = 'https://example.com'
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    print('Page retrieved successfully')
    print('Content:', response.text)
else:
    print('Failed to retrieve page', response.status_code)

Try it in Google Colab:

Page retrieved successfully
Content: (HTML content of the page)

Parsing HTML with BeautifulSoup

Once we have the HTML content, we use BeautifulSoup to parse it. BeautifulSoup provides methods and Pythonic idioms that make it easy to navigate, search, and modify the parse tree. It's a powerful tool for extracting data from HTML.

example2.py

from bs4 import BeautifulSoup
import requests

url = 'https://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Extract all paragraph texts
paragraphs = soup.find_all('p')
for p in paragraphs:
    print(p.get_text())

💡 Tip: Always check the website's 'robots.txt' file before scraping to ensure compliance with their terms of service.

❓ What does the'status_code' attribute in a response object represent?

The size of the response The content type of the response The HTTP status of the response The URL of the response

❓ Which method in BeautifulSoup is used to find all instances of a particular HTML tag?

find() select() find_all() get_text()

Web Scraping with Python

Understanding HTTP Requests

Parsing HTML with BeautifulSoup

Related Courses