Web Scraping with Python
Duration: 8 min
This module delves into the art of web scraping using Python, a crucial skill for AI developers looking to gather data from the web. We'll explore libraries such as BeautifulSoup and requests, and understand how to navigate and extract information from HTML and XML documents. This skill is essential for training machine learning models and data analysis.
Understanding HTTP Requests
Web scraping begins with making HTTP requests to retrieve the content of a web page. Python's 'requests' library simplifies this process, allowing us to send HTTP requests and handle responses. It's important to understand headers, status codes, and how to handle different types of content.
example1.py
import requests
url = 'https://example.com'
response = requests.get(url)
# Check if the request was successful
if response.status_code == 200:
print('Page retrieved successfully')
print('Content:', response.text)
else:
print('Failed to retrieve page', response.status_code)Page retrieved successfully
Content: (HTML content of the page)Parsing HTML with BeautifulSoup
Once we have the HTML content, we use BeautifulSoup to parse it. BeautifulSoup provides methods and Pythonic idioms that make it easy to navigate, search, and modify the parse tree. It's a powerful tool for extracting data from HTML.
example2.py
from bs4 import BeautifulSoup
import requests
url = 'https://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Extract all paragraph texts
paragraphs = soup.find_all('p')
for p in paragraphs:
print(p.get_text())💡 Tip: Always check the website's 'robots.txt' file before scraping to ensure compliance with their terms of service.
❓ What does the'status_code' attribute in a response object represent?
❓ Which method in BeautifulSoup is used to find all instances of a particular HTML tag?