Module 1 of 21 · Computer Vision · Intermediate

Introduction to Computer Vision

Duration: 5 min

This module provides an introduction to the field of computer vision, focusing on key concepts and techniques such as Convolutional Neural Networks (CNNs), object detection methods like YOLO and Faster R-CNN, image segmentation techniques, and architectures like U-Net and Mask R-CNN. Understanding these concepts is crucial for developing applications that can interpret and make decisions based on visual data.

CNN Architecture

Convolutional Neural Networks (CNNs)

CNNs are a class of deep neural networks, most commonly applied to analyzing visual imagery. They are composed of multiple layers, including convolutional layers, pooling layers, and fully connected layers. The convolutional layers apply a convolution operation to the input, passing the result to the next layer. The pooling layers reduce the spatial size of the representation, reducing the amount of parameters and computation in the network. Fully connected layers perform classification based on the features extracted by the convolutional layers and down-sampled by the pooling layers.

import tensorflow as tf
from tensorflow.keras import layers, models

# Define a simple CNN model
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

Try it in Google Colab: Open in Colab

Model: "sequential"
__________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)             (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 32)       0         
_________________________________________________________________
conv2d_1 (Conv2D)           (None, 11, 11, 64)       18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_2 (Conv2D)           (None, 3, 3, 64)         36928     
_________________________________________________________________
flatten (Flatten)           (None, 1152)              0         
_________________________________________________________________
dens (Dense)                (None, 64)                73856     
_________________________________________________________________
dens_1 (Dense)              (None, 10)                650       
=================================================================
Total params: 129,250
Trainable params: 129,250
Non-trainable params: 0
_________________________________________________________________

Object Detection

Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos. Well-researched areas include face detection, pedestrian detection, and vehicle detection. Two popular object detection frameworks are YOLO (You Only Look Once) and Faster R-CNN (Region-based Convolutional Neural Networks).

import cv2

# Load a pre-trained YOLO model
net = cv2.dnn.readNet('yolov3.weights', 'yolov3.cfg')

# Load class names
with open('coco.names', 'r') as f:
    classes = [line.strip() for line in f.readlines()]

# Load an image
image = cv2.imread('image.jpg')
height, width = image.shape[:2]

# Prepare the image for the model
blob = cv2.dnn.blobFromImage(image, 0.00392, (416, 416), (0, 0, 0), True, crop=False)
net.setInput(blob)

# Get the output layer names
layer_names = net.getLayerNames()
output_layers = [layer_names[i[0] - 1] for i in net.getUnconnectedOutLayers()]

# Run the forward pass
outs = net.forward(output_layers)

# Process the detections
class_ids = []
confidences = []
boxes = []
for out in outs:
    for detection in out:
        scores = detection[5:]
        class_id = np.argmax(scores)
        confidence = scores[class_id]
        if confidence > 0.5:
            center_x = int(detection[0] * width)
            center_y = int(detection[1] * height)
            w = int(detection[2] * width)
            h = int(detection[3] * height)
            x = int(center_x - w / 2)
            y = int(center_y - h / 2)
            boxes.append([x, y, w, h])
            confidences.append(float(confidence))
            class_ids.append(class_id)

# Apply non-max suppression
indexes = cv2.dnn.NMSBoxes(boxes, confidences, 0.5, 0.4)

# Draw the bounding boxes
for i in range(len(boxes)):
    if i in indexes:
        x, y, w, h = boxes[i]
        label = str(classes[class_ids[i]])
        cv2.rectangle(image, (x, y), (x + w, y + h), (0, 255, 0), 2)
        cv2.putText(image, label, (x, y + 30), cv2.FONT_HERSHEY_PLAIN, 3, (0, 0, 255), 2)

# Display the image
cv2.imshow('Image', image)
cv2.waitKey(0)
cv2.destroyAllWindows()

💡 Tip: When working with object detection models like YOLO or Faster R-CNN, ensure that the input images are pre-processed correctly according to the model's requirements. This includes resizing, normalization, and potentially other transformations. Incorrect pre-processing can lead to poor detection performance.

❓ What is the primary function of convolutional layers in a CNN?

❓ Which object detection framework uses a single neural network to predict bounding boxes and class probabilities directly from full images in one evaluation?

Continue interactively → Next →

Related Courses