1. Trang chủ
  2. » Công Nghệ Thông Tin

Tài liệu về DEEPLEARNING FOR COMPUTER VISION

332 41 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Deep Learning for Computer Vision
Tác giả Adrian Rosebrock
Trường học PyImageSearch
Chuyên ngành Computer Vision
Thể loại starter bundle
Năm xuất bản 2017
Thành phố Unknown
Định dạng
Số trang 332
Dung lượng 26,41 MB

Cấu trúc

  • 1 Introduction

    • 1.1 I Studied Deep Learning the Wrong Way…This Is the Right Way

    • 1.2 Who This Book Is For

      • 1.2.1 Just Getting Started in Deep Learning?

      • 1.2.2 Already a Seasoned Deep Learning Practitioner?

    • 1.3 Book Organization

      • 1.3.1 Volume #1: Starter Bundle

      • 1.3.2 Volume #2: Practitioner Bundle

      • 1.3.3 Volume #3: ImageNet Bundle

      • 1.3.4 Need to Upgrade Your Bundle?

    • 1.4 Tools of the Trade: Python, Keras, and Mxnet

      • 1.4.1 What About TensorFlow?

      • 1.4.2 Do I Need to Know OpenCV?

    • 1.5 Developing Our Own Deep Learning Toolset

    • 1.6 Summary

  • 2 What Is Deep Learning?

    • 2.1 A Concise History of Neural Networks and Deep Learning

    • 2.2 Hierarchical Feature Learning

    • 2.3 How "Deep" Is Deep?

    • 2.4 Summary

  • 3 Image Fundamentals

    • 3.1 Pixels: The Building Blocks of Images

      • 3.1.1 Forming an Image From Channels

    • 3.2 The Image Coordinate System

      • 3.2.1 Images as NumPy Arrays

      • 3.2.2 RGB and BGR Ordering

    • 3.3 Scaling and Aspect Ratios

    • 3.4 Summary

  • 4 Image Classification Basics

    • 4.1 What Is Image Classification?

      • 4.1.1 A Note on Terminology

      • 4.1.2 The Semantic Gap

      • 4.1.3 Challenges

    • 4.2 Types of Learning

      • 4.2.1 Supervised Learning

      • 4.2.2 Unsupervised Learning

      • 4.2.3 Semi-supervised Learning

    • 4.3 The Deep Learning Classification Pipeline

      • 4.3.1 A Shift in Mindset

      • 4.3.2 Step #1: Gather Your Dataset

      • 4.3.3 Step #2: Split Your Dataset

      • 4.3.4 Step #3: Train Your Network

      • 4.3.5 Step #4: Evaluate

      • 4.3.6 Feature-based Learning versus Deep Learning for Image Classification

      • 4.3.7 What Happens When my Predictions Are Incorrect?

    • 4.4 Summary

  • 5 Datasets for Image Classification

    • 5.1 MNIST

    • 5.2 Animals: Dogs, Cats, and Pandas

    • 5.3 CIFAR-10

    • 5.4 SMILES

    • 5.5 Kaggle: Dogs vs. Cats

    • 5.6 Flowers-17

    • 5.7 CALTECH-101

    • 5.8 Tiny ImageNet 200

    • 5.9 Adience

    • 5.10 ImageNet

      • 5.10.1 What Is ImageNet?

      • 5.10.2 ImageNet Large Scale Visual Recognition Challenge (ILSVRC)

    • 5.11 Kaggle: Facial Expression Recognition Challenge

    • 5.12 Indoor CVPR

    • 5.13 Stanford Cars

    • 5.14 Summary

  • 6 Configuring Your Development Environment

    • 6.1 Libraries and Packages

      • 6.1.1 Python

      • 6.1.2 Keras

      • 6.1.3 Mxnet

      • 6.1.4 OpenCV, scikit-image, scikit-learn, and more

    • 6.2 Configuring Your Development Environment?

    • 6.3 Preconfigured Virtual Machine

    • 6.4 Cloud-based Instances

    • 6.5 How to Structure Your Projects

    • 6.6 Summary

  • 7 Your First Image Classifier

    • 7.1 Working with Image Datasets

      • 7.1.1 Introducing the “Animals” Dataset

      • 7.1.2 The Start to Our Deep Learning Toolkit

      • 7.1.3 A Basic Image Preprocessor

      • 7.1.4 Building an Image Loader

    • 7.2 k-NN: A Simple Classifier

      • 7.2.1 A Worked k-NN Example

      • 7.2.2 k-NN Hyperparameters

      • 7.2.3 Implementing k-NN

      • 7.2.4 k-NN Results

      • 7.2.5 Pros and Cons of k-NN

    • 7.3 Summary

  • 8 Parameterized Learning

    • 8.1 An Introduction to Linear Classification

      • 8.1.1 Four Components of Parameterized Learning

      • 8.1.2 Linear Classification: From Images to Labels

      • 8.1.3 Advantages of Parameterized Learning and Linear Classification

      • 8.1.4 A Simple Linear Classifier With Python

    • 8.2 The Role of Loss Functions

      • 8.2.1 What Are Loss Functions?

      • 8.2.2 Multi-class SVM Loss

      • 8.2.3 Cross-entropy Loss and Softmax Classifiers

    • 8.3 Summary

  • 9 Optimization Methods and Regularization

    • 9.1 Gradient Descent

      • 9.1.1 The Loss Landscape and Optimization Surface

      • 9.1.2 The “Gradient” in Gradient Descent

      • 9.1.3 Treat It Like a Convex Problem (Even if It’s Not)

      • 9.1.4 The Bias Trick

      • 9.1.5 Pseudocode for Gradient Descent

      • 9.1.6 Implementing Basic Gradient Descent in Python

      • 9.1.7 Simple Gradient Descent Results

    • 9.2 Stochastic Gradient Descent (SGD)

      • 9.2.1 Mini-batch SGD

      • 9.2.2 Implementing Mini-batch SGD

      • 9.2.3 SGD Results

    • 9.3 Extensions to SGD

      • 9.3.1 Momentum

      • 9.3.2 Nesterov's Acceleration

      • 9.3.3 Anecdotal Recommendations

    • 9.4 Regularization

      • 9.4.1 What Is Regularization and Why Do We Need It?

      • 9.4.2 Updating Our Loss and Weight Update To Include Regularization

      • 9.4.3 Types of Regularization Techniques

      • 9.4.4 Regularization Applied to Image Classification

    • 9.5 Summary

  • 10 Neural Network Fundamentals

    • 10.1 Neural Network Basics

      • 10.1.1 Introduction to Neural Networks

      • 10.1.2 The Perceptron Algorithm

      • 10.1.3 Backpropagation and Multi-layer Networks

      • 10.1.4 Multi-layer Networks with Keras

      • 10.1.5 The Four Ingredients in a Neural Network Recipe

      • 10.1.6 Weight Initialization

      • 10.1.7 Constant Initialization

      • 10.1.8 Uniform and Normal Distributions

      • 10.1.9 LeCun Uniform and Normal

      • 10.1.10 Glorot/Xavier Uniform and Normal

      • 10.1.11 He et al./Kaiming/MSRA Uniform and Normal

      • 10.1.12 Differences in Initialization Implementation

    • 10.2 Summary

  • 11 Convolutional Neural Networks

    • 11.1 Understanding Convolutions

      • 11.1.1 Convolutions versus Cross-correlation

      • 11.1.2 The “Big Matrix” and “Tiny Matrix" Analogy

      • 11.1.3 Kernels

      • 11.1.4 A Hand Computation Example of Convolution

      • 11.1.5 Implementing Convolutions with Python

      • 11.1.6 The Role of Convolutions in Deep Learning

    • 11.2 CNN Building Blocks

      • 11.2.1 Layer Types

      • 11.2.2 Convolutional Layers

      • 11.2.3 Activation Layers

      • 11.2.4 Pooling Layers

      • 11.2.5 Fully-connected Layers

      • 11.2.6 Batch Normalization

      • 11.2.7 Dropout

    • 11.3 Common Architectures and Training Patterns

      • 11.3.1 Layer Patterns

      • 11.3.2 Rules of Thumb

    • 11.4 Are CNNs Invariant to Translation, Rotation, and Scaling?

    • 11.5 Summary

  • 12 Training Your First CNN

    • 12.1 Keras Configurations and Converting Images to Arrays

      • 12.1.1 Understanding the keras.json Configuration File

      • 12.1.2 The Image to Array Preprocessor

    • 12.2 ShallowNet

      • 12.2.1 Implementing ShallowNet

      • 12.2.2 ShallowNet on Animals

      • 12.2.3 ShallowNet on CIFAR-10

    • 12.3 Summary

  • 13 Saving and Loading Your Models

    • 13.1 Serializing a Model to Disk

    • 13.2 Loading a Pre-trained Model from Disk

    • 13.3 Summary

  • 14 LeNet: Recognizing Handwritten Digits

    • 14.1 The LeNet Architecture

    • 14.2 Implementing LeNet

    • 14.3 LeNet on MNIST

    • 14.4 Summary

  • 15 MiniVGGNet: Going Deeper with CNNs

    • 15.1 The VGG Family of Networks

      • 15.1.1 The (Mini) VGGNet Architecture

    • 15.2 Implementing MiniVGGNet

    • 15.3 MiniVGGNet on CIFAR-10

      • 15.3.1 With Batch Normalization

      • 15.3.2 Without Batch Normalization

    • 15.4 Summary

  • 16 Learning Rate Schedulers

    • 16.1 Dropping Our Learning Rate

      • 16.1.1 The Standard Decay Schedule in Keras

      • 16.1.2 Step-based Decay

      • 16.1.3 Implementing Custom Learning Rate Schedules in Keras

    • 16.2 Summary

  • 17 Spotting Underfitting and Overfitting

    • 17.1 What Are Underfitting and Overfitting?

      • 17.1.1 Effects of Learning Rates

      • 17.1.2 Pay Attention to Your Training Curves

      • 17.1.3 What if Validation Loss Is Lower than Training Loss?

    • 17.2 Monitoring the Training Process

      • 17.2.1 Creating a Training Monitor

      • 17.2.2 Babysitting Training

    • 17.3 Summary

  • 18 Checkpointing Models

    • 18.1 Checkpointing Neural Network Model Improvements

    • 18.2 Checkpointing Best Neural Network Only

    • 18.3 Summary

  • 19 Visualizing Network Architectures

    • 19.1 The Importance of Architecture Visualization

      • 19.1.1 Installing graphviz and pydot

      • 19.1.2 Visualizing Keras Networks

    • 19.2 Summary

  • 20 Out-of-the-box CNNs for Classification

    • 20.1 State-of-the-art CNNs in Keras

      • 20.1.1 VGG16 and VGG19

      • 20.1.2 ResNet

      • 20.1.3 Inception V3

      • 20.1.4 Xception

      • 20.1.5 Can We Go Smaller?

    • 20.2 Classifying Images with Pre-trained ImageNet CNNs

      • 20.2.1 Classification Results

    • 20.3 Summary

  • 21 Case Study: Breaking Captchas with a CNN

    • 21.1 Breaking Captchas with a CNN

      • 21.1.1 A Note on Responsible Disclosure

      • 21.1.2 The Captcha Breaker Directory Structure

      • 21.1.3 Automatically Downloading Example Images

      • 21.1.4 Annotating and Creating Our Dataset

      • 21.1.5 Preprocessing the Digits

      • 21.1.6 Training the Captcha Breaker

      • 21.1.7 Testing the Captcha Breaker

    • 21.2 Summary

  • 22 Case Study: Smile Detection

    • 22.1 The SMILES Dataset

    • 22.2 Training the Smile CNN

    • 22.3 Running the Smile CNN in Real-time

    • 22.4 Summary

  • 23 Your Next Steps

    • 23.1 So, What's Next?

Nội dung

Tài liệu hay nhất về cách tiếp cận DEEPLEARNING cho người mới bắt đầu, từ cơ bản tới nâng cao bao gồm cả code, gồm 3 phần đây là phần một Starter Bundle, phần 2 Practitioner Bundle, phần 3 ImageNet Bundle

I Studied Deep Learning the Wrong Way This Is the Right Way 15

I want to start this book by sharing a personal story with you:

In the final stages of my graduate studies during 2013-2014, I began to explore the concept of "deep learning" due to a unique timing circumstance My dissertation was nearly complete, with all my Ph.D committee members having approved it However, university and department regulations required me to remain enrolled for an additional semester.

Before I could officially defend my dissertation and graduate, I had a gap of approximately four months This time proved to be an excellent opportunity to dive into studying deep learning.

As an academic, my initial step involved reviewing recent publications on deep learning With my background in machine learning, I quickly understood the theoretical foundations of deep learning.

In my view, true learning occurs only when theoretical knowledge is put into practice The transition from theory to implementation is a distinct process, as any computer scientist familiar with data structures will attest For instance, understanding red-black trees in theory differs significantly from the practical experience of implementing them from the ground up, highlighting the necessity of diverse skill sets for each phase.

And that’sexactlywhat my problem was.

After exploring various deep learning publications, I found myself confused and unable to implement the algorithms or reproduce the results My frustration grew as I dedicated hours to searching for deep learning tutorials online, but I often came up empty-handed, as there were very few resources available at that time.

Finally, I resorted to playing around with libraries and tools such as Caffe, Theano, and Torch, blindly followed poorly written blog posts (with mixed results, to say the least).

Iwantedto get started, but nothing had actuallyclickedyet – the deep learning lightbulb in my head was stuck in the “off” position.

This past semester was emotionally challenging as I recognized the significant potential of deep learning in computer vision Despite my efforts, I was left with little to show for it, merely a pile of deep learning papers that I comprehended but found difficult to implement.

After months of dedicated trial-and-error experiments, I achieved deep learning success, transforming my understanding of the field This intensive four-month journey, filled with late nights and perseverance, significantly influenced my life and research direction.

but I wouldnotadvise you to take the same path I did.

If you takeanythingfrom my personal experience, it should be this:

1 You don’t need a decade of theory to get started in deep learning.

2 You don’t need pages and pages of equations.

You don't need a computer science degree to succeed in deep learning, although it can be beneficial When I began my journey in deep learning, I focused too much on theoretical publications without applying that knowledge practically While understanding theory is crucial, the ability to implement it in real-world applications is essential for finding your niche in the deep learning field.

Deep learning and other advanced computer science disciplines emphasize the importance of practical experience alongside theoretical knowledge This realization inspired me to write "Deep Learning for Computer Vision with Python," aiming to guide readers in becoming proficient practitioners in the field.

1 Textbooks that will teach you the theoretical underpinnings of machine learning, neural networks, and deep learning

2 And countless “cookbook”-style resources that will “show you in code”, but never relate the code back to true theoretical knowledge .

noneof these books or resources will serve as the bridge between the other.

On one side of the bridge, you find textbooks rich in theory and abstract concepts, while on the other side, there are practical coding books that focus on providing clear examples and hands-on learning experiences.

Who This Book Is For 17

Just Getting Started in Deep Learning?

Don't worry about overwhelming theory or complicated equations; we will begin with the fundamentals of machine learning and neural networks You'll engage in a fun and practical learning experience filled with coding examples Additionally, I will share key references to influential papers in the machine learning field, allowing you to deepen your understanding once you have established a solid foundation.

The key to success is to take the first step and get started No matter your current skill level, you can trust that I will guide you through the learning process By the end of the initial chapters of this book, you'll have a solid understanding of neural networks and be ready to tackle more advanced topics with confidence.

Already a Seasoned Deep Learning Practitioner?

This book caters to both beginners and advanced readers, offering a wealth of knowledge on deep learning for computer vision using Python Each chapter includes academic references to enhance your understanding, while the explanations of complex concepts are presented in a clear and accessible manner.

The strategies and solutions presented in this book can be readily implemented in your current job and research, making it a valuable resource By reading "Deep Learning for Computer Vision with Python," you'll save significant time, and the knowledge gained will prove beneficial for your projects and research endeavors.

Book Organization 17

Volume #1: Starter Bundle

The Starter Bundle is a great fit if you’re taking your first steps toward deep learning for image classification mastery.

You’ll learn the basics of:

4 How to work with your own custom datasets

Volume #2: Practitioner Bundle

The Practitioner Bundle enhances the Starter Bundle, making it ideal for those seeking to delve deeply into deep learning, master advanced techniques, and learn essential best practices and guidelines.

Volume #3: ImageNet Bundle

The ImageNet Bundle offers a comprehensive deep learning experience for computer vision This book volume guides readers through training large-scale neural networks on the extensive ImageNet dataset and explores practical case studies such as age and gender prediction, vehicle make and model identification, and facial expression recognition, among others.

Need to Upgrade Your Bundle?

To upgrade your bundle quickly and easily, simply send me a message, and I will assist you with the upgrade process as soon as possible Visit http://www.pyimagesearch.com/contact/ to get started.

Tools of the Trade: Python, Keras, and Mxnet 18

What About TensorFlow?

TensorFlow and Theano are versatile libraries designed for creating abstract, general-purpose computation graphs Although primarily associated with deep learning, they serve a wide range of applications beyond this field, highlighting their broad utility in various computational tasks.

Keras, on the other hand,is a deep learning framework that provides a well-designed API to facilitate building deep neural networks with ease Under the hood, Keras uses either the

1.5 Developing Our Own Deep Learning Toolset 19

TensorFlow or Theano computational backend, allowing it to take advantage of these powerful computation engines.

A computational backend serves as the engine of your application, allowing for the replacement and optimization of parts, or even the entire engine, as long as it meets specific requirements By using Keras, you gain the flexibility to switch between different engines, enabling you to select the most suitable one for your project's needs.

Thinking of this at a different angle, using TensorFlow or Theano to build a deep neural network would be akin to utilizing strictly NumPy to build a machine learning classifier.

Using a dedicated machine learning library like scikit-learn is more advantageous than relying on NumPy, as it simplifies the coding process and reduces the amount of code needed significantly.

In the same vein, Kerassits on topof TensorFlow or Theano, enjoying:

1 The benefits of a powerful underlying computation engine

2 An API that makes it easier for you to build your own deep learning networks

Keras will soon be integrated into the core TensorFlow library at Google, allowing for seamless integration of TensorFlow code directly into Keras models This development offers users the advantage of leveraging the strengths of both frameworks, enhancing the overall functionality and flexibility of their machine learning projects.

Do I Need to Know OpenCV?

Youdo notneed to know the OpenCV computer vision and image processing library [7] to be successful when going through this book.

We only use OpenCV to facilitate basic image processing operations such as loading an image from disk, displaying it to our screen, and a few other basic operations.

Gaining experience with OpenCV is invaluable, especially for beginners in computer vision I strongly suggest studying this book alongside my other publication, "Practical Python and OpenCV," for a comprehensive understanding.

Remember, deep learning is onlyone facetof computer vision – there are a number of computer vision techniques you should study to round out your knowledge.

Developing Our Own Deep Learning Toolset 19

TensorFlow or Theano computational backend, allowing it to take advantage of these powerful computation engines.

A computational backend functions like a car engine, allowing for the replacement and optimization of components or even the entire engine, as long as it meets specific requirements By using Keras, you gain the flexibility to switch between different backends, enabling you to select the most suitable engine for your project.

Thinking of this at a different angle, using TensorFlow or Theano to build a deep neural network would be akin to utilizing strictly NumPy to build a machine learning classifier.

Using a dedicated machine learning library like scikit-learn is more advantageous than relying on NumPy, as it simplifies the coding process and reduces the amount of code needed.

In the same vein, Kerassits on topof TensorFlow or Theano, enjoying:

1 The benefits of a powerful underlying computation engine

2 An API that makes it easier for you to build your own deep learning networks

Keras will be integrated into the core TensorFlow library at Google, allowing seamless integration of TensorFlow code into Keras models This combination offers users the advantages of both frameworks, enhancing the overall development experience.

1.4.2 Do I Need to Know OpenCV?

Youdo notneed to know the OpenCV computer vision and image processing library [7] to be successful when going through this book.

We only use OpenCV to facilitate basic image processing operations such as loading an image from disk, displaying it to our screen, and a few other basic operations.

If you're new to OpenCV and computer vision, gaining some experience with OpenCV is beneficial I highly recommend working through this book alongside my other publication, "Practical Python and OpenCV," for a comprehensive learning experience.

Remember, deep learning is onlyone facetof computer vision – there are a number of computer vision techniques you should study to round out your knowledge.

1.5 Developing Our Own Deep Learning Toolset

This book aims to showcase how to leverage existing deep learning libraries to create a personalized Python toolset, empowering users to train their own deep learning networks effectively.

This book presents a unique deep learning toolkit that I have meticulously developed and refined through years of personal research and development in the field.

By the conclusion of "Deep Learning for Computer Vision with Python," we will systematically develop components of a comprehensive toolset, enabling us to effectively utilize deep learning techniques for computer vision applications.

1 Load image datasets from disk, store them in memory, or write them to an optimized database format.

2 Preprocess images such that they are suitable for training a Convolutional Neural Network.

3 Create a blueprint class that can be used to build our own custom implementations of Convolutional Neural Networks.

4 Implement popular CNN architectures by hand, such as AlexNet, VGGNet, GoogLeNet, ResNet, and SqueezeNet (and train them from scratch).

Summary 20

We are currently experiencing a remarkable era in the fields of machine learning, neural networks, and deep learning, marked by unprecedented advancements and exceptional tools at our disposal.

Advancements in software have introduced libraries like Keras and MXNet, which come with Python bindings, allowing for the rapid development of deep learning architectures This innovation significantly reduces the time required to build complex models compared to previous years.

General-purpose GPUs are becoming more affordable and powerful, enabling individuals with modest budgets to build simple gaming rigs for meaningful deep learning research The advancements in GPU technology, coupled with modular libraries and innovative researchers, are driving a surge in new publications that advance the state-of-the-art in deep learning on a monthly basis.

You see,nowis the time to undertake studying deep learning for computer vision.

Seize this historic opportunity in deep learning, as early adopters are likely to reap significant rewards from their investments of time, resources, and creativity.

Enjoy this book I’m excited to see where it takes you in this amazing field.

Deep learning methods utilize multiple layers of representation to transform raw input into more abstract forms through simple, nonlinear modules A defining characteristic of deep learning is that these layers are not manually engineered but are instead learned from data using a general-purpose learning procedure, as highlighted by experts Yann LeCun, Yoshua Bengio, and Geoffrey Hinton in their 2015 Nature article.

Deep learning is a subfield of machine learning, which is, in turn, a subfield of artificial intelligence (AI) For a graphical depiction of this relationship, please refer to Figure 2.1.

The primary objective of artificial intelligence (AI) is to develop algorithms and techniques that can tackle complex problems that humans can solve intuitively and almost effortlessly A prime example of this challenge is image interpretation and understanding, a task that humans manage with ease but remains highly difficult for machines to execute effectively.

Artificial Intelligence (AI) encompasses a wide range of tasks involving automatic machine reasoning, including inference, planning, and heuristics In contrast, the machine learning subfield focuses primarily on pattern recognition and data-driven learning.

Artificial Neural Networks (ANNs) are machine learning algorithms designed for pattern recognition, drawing inspiration from the brain's structure and function Deep learning, a subset of ANNs, is often used interchangeably with the term Interestingly, the deep learning field has existed for over 60 years, evolving through various names and forms influenced by research trends, hardware advancements, and the preferences of leading researchers.

This chapter will provide an overview of the history of deep learning, explore the defining characteristics of deep neural networks, and introduce the concept of hierarchical learning, which has significantly contributed to the success of deep learning in contemporary machine learning and computer vision.

Figure 2.1: A Venn diagram describing deep learning as a subfield of machine learning which is in turn a subfield of artificial intelligence (Image inspired by Figure 1.4 of Goodfellow et al [10]).

A Concise History of Neural Networks and Deep Learning 22

The history of neural networks and deep learning dates back to the 1940s, evolving through various terminologies such as cybernetics and connectionism, with the most recognized term being Artificial Neural Networks (ANNs).

Artificial Neural Networks (ANNs) are inspired by the human brain's neuron interactions but are not intended to serve as accurate brain models Instead, they provide a foundational understanding of how to replicate certain brain-like behaviors in artificial systems The first neural network model, developed by McCulloch and Pitts in 1943, functioned as a binary classifier that could identify two categories based on input data However, this model required manual tuning of weights by humans for classifying inputs, which poses scalability challenges when human intervention is necessary.

In the 1950s, Rosenblatt introduced the groundbreaking Perceptron algorithm, enabling automatic learning of weights for input classification without human intervention This innovative model laid the foundation for Stochastic Gradient Descent (SGD), a training method that remains essential for training deep neural networks today.

In the late 1960s, Perceptron-based techniques dominated the neural network landscape, but a pivotal 1969 study by Minsky and Papert halted progress in the field for nearly ten years Their research revealed that a Perceptron using a linear activation function, regardless of its complexity, functioned solely as a linear classifier, rendering it incapable of addressing nonlinear problems A prime example of such a challenge is the XOR dataset, illustrating the impossibility of using a single linear boundary to separate distinct classes.

2.1 A Concise History of Neural Networks and Deep Learning 23

The simple Perceptron network architecture, illustrated in Figure 2.2, processes multiple inputs by calculating a weighted sum and applying a step function to generate the final prediction A comprehensive examination of the Perceptron will be provided in Chapter 10.

The authors contended that, at that time, we lacked the computational resources necessary to build large, deep neural networks, a prediction that has proven to be accurate This pivotal paper nearly brought an end to neural network research.

Luckily, the backpropagation algorithm and the research by Werbos (1974) [15], Rumelhart

In 1986, significant advancements in neural networks were made by researchers, including LeCun in 1998, who revitalized the field Their development of the backpropagation algorithm allowed for the effective training of multi-layer feedforward neural networks, marking a pivotal moment in artificial intelligence.

The introduction of nonlinear activation functions enabled researchers to address the XOR problem, paving the way for significant advancements in neural network research Subsequent studies revealed that neural networks function as universal approximators, capable of approximating any continuous function, although there is no assurance that the network can effectively learn the necessary parameters to represent that function.

The backpropagation algorithm is essential for training modern neural networks, enabling them to learn from errors effectively However, researchers faced challenges in training networks with more than two hidden layers due to the limitations of slow computers and the scarcity of large, labeled training datasets, making such tasks computationally impractical.

Deep learning represents the latest advancement in neural networks, distinguished by the use of faster, specialized hardware and an abundance of training data This evolution allows for the training of networks with numerous hidden layers, enabling hierarchical learning where basic concepts are understood in the lower layers and more complex patterns are identified in the upper layers.

Perhaps the quintessential example of applied deep learning to feature learning is theConvo-

The XOR (Exclusive Or) dataset exemplifies a nonlinear separable problem that the Perceptron cannot solve, as it is impossible to draw a single line to separate the blue stars from the red circles In contrast, the Convolutional Neural Network, introduced by LeCun in 1988, is effective for handwritten character recognition This network automatically learns discriminating patterns, known as "filters," by stacking layers sequentially Lower-level filters identify basic features like edges and corners, while higher-level layers utilize these features to understand more abstract concepts, aiding in the differentiation of image classes.

Convolutional Neural Networks (CNNs) have emerged as the leading image classifiers, driving advancements in various computer vision applications that utilize machine learning For an in-depth exploration of the evolution of neural networks and deep learning, consider reviewing Goodfellow et al [10] and Jason Brownlee's insightful blog post at Machine Learning Mastery [20].

Hierarchical Feature Learning 24

Machine learning algorithms can be categorized into three main types: supervised, unsupervised, and semi-supervised learning This chapter will focus on explaining supervised and unsupervised learning, while semi-supervised learning will be addressed in a later discussion.

In supervised learning, a machine learning algorithm is provided with a set of input data and corresponding target outputs, allowing it to identify patterns that map inputs to their correct outputs This process resembles a teacher overseeing a student's test, where the student utilizes prior knowledge to answer questions and receives guidance from the teacher to improve accuracy on future attempts.

In unsupervised machine learning, algorithms autonomously identify distinguishing features without prior guidance on the input data In this context, a student aims to cluster similar questions and answers, despite lacking knowledge of the correct answers and without a teacher to provide validation.

A multi-layer, feedforward network architecture consists of an input layer with three nodes, two hidden layers (the first with two nodes and the second with three nodes), and an output layer containing two nodes Unsupervised learning presents greater challenges compared to supervised learning, as it lacks target outputs, making it difficult to identify patterns that accurately map input data to the correct classifications.

In machine learning for image classification, algorithms analyze sets of images to identify patterns that differentiate between various classes or objects.

Historically, image analysis relied on hand-engineered features rather than raw pixel intensities, which are now prevalent in deep learning approaches In our dataset, we utilized feature extraction to quantify each image through specific algorithms known as feature extractors or image descriptors This process generates a vector, or numerical list, that effectively represents the contents of the image For instance, Figure 2.5 illustrates how we quantified an image of prescription pill medication using various black-box descriptors focusing on color, texture, and shape.

Our hand-engineered features attempted to encode texture (Local Binary Patterns [21], Haralick texture [22]), shape (Hu Moments [23], Zernike Moments [24]), and color (color moments, color histograms, color correlograms [25]).

Various techniques, including keypoint detectors like FAST, Harris, and DoG, alongside local invariant descriptors such as SIFT, SURF, BRIEF, and ORB, are utilized to identify and describe the most salient regions of an image.

The Histogram of Oriented Gradients (HOG) method has demonstrated effective object detection in images, particularly when the viewpoint angle closely aligns with the training data of the classifier An illustration of this technique can be seen in the application of the HOG combined with a Linear SVM detector.

Figure 2.5 illustrates the process of quantifying the contents of an image featuring prescription pill medication through various blackbox descriptors, including color, texture, and shape Additionally, Figure 2.6 demonstrates the detection of stop signs within images.

For a while, research in object detection in images was guided by HOG and its variants, including computationally expensive methods such as the Deformable Parts Model [34] and Exemplar SVMs [35].

R For a more in-depth study of image descriptors, feature extraction, and the process it plays in computer vision, be sure to refer to the PyImageSearch Gurus course [33].

In various scenarios, we utilized hand-defined algorithms to quantify and encode specific aspects of images, such as shape, texture, and color By applying these algorithms to input pixel data, we generated feature vectors that effectively represented the contents of the images The original image pixels served solely as inputs for this feature extraction process, while the resulting feature vectors became the focal point of our interest, as they were essential inputs for our machine learning models.

Deep learning, particularly through Convolutional Neural Networks (CNNs), revolutionizes image analysis by automatically learning features during the training process, rather than relying on manually defined rules and algorithms.

Again, let’s return to the goal of machine learning: computers should be able to learn from experience (i.e., examples) of the problem they are trying to solve.

Deep learning enables us to comprehend problems through a hierarchy of concepts, where each level builds upon the previous one Lower layers of the network capture fundamental representations, while higher layers synthesize these basics into more abstract ideas This hierarchical approach eliminates the need for manual feature extraction, allowing Convolutional Neural Networks (CNNs) to function as end-to-end learners.

In a Convolutional Neural Network (CNN), pixel intensity values from an image serve as inputs, which are processed through a series of hierarchical hidden layers to extract features Initially, the lower layers detect edge-like regions, which help identify corners where edges intersect and contours that outline objects By combining these corners and contours, the network progresses to recognize abstract "object parts" in subsequent layers.

The filters in this system automatically learn to detect various concepts without any human intervention in the learning process Ultimately, the output layer presents the results of this automated learning.

How "Deep" Is Deep? 27

The HOG + Linear SVM object detection framework is utilized to identify stop sign locations in images, as demonstrated in the PyImageSearch Gurus course This framework classifies images and generates output class labels, where the output layer is influenced by all other nodes within the network.

Hierarchical learning in neural networks allows each layer to utilize the outputs from previous layers as foundational elements to develop more abstract concepts This process occurs automatically, eliminating the need for manual feature engineering A comparison between traditional image classification methods, which rely on hand-crafted features, and modern representation learning through deep learning and Convolutional Neural Networks highlights this advancement.

Deep learning and Convolutional Neural Networks (CNNs) offer the significant advantage of eliminating the need for manual feature extraction, enabling us to concentrate on training the network to automatically learn filters However, achieving satisfactory accuracy when training a network on a specific image dataset can be challenging.

To quote Jeff Dean from his 2016 talk,Deep Learning for Building Intelligent Computer Systems [36]:

Deep learning is characterized by its use of large neural networks with multiple layers, which is reflected in the term "deep." This popular terminology has gained traction in media discussions surrounding advanced artificial intelligence technologies.

Deep learning can be understood as large neural networks with multiple layers that enhance complexity and depth However, a key challenge remains: there is no definitive answer to how many layers are necessary for a neural network to be classified as deep.

The short answer is there isno consensusamongst experts on the depth of a network to be considered deep [10].

The traditional image processing method involves manually extracting features from a set of input images and then training a machine learning classifier based on those features In contrast, the deep learning approach utilizes a layered architecture that automatically learns complex, abstract, and discriminative features from the images, enhancing the overall classification performance.

When considering the type of network, it's essential to understand that a Convolutional Neural Network (CNN) is classified as a deep learning algorithm However, if a CNN consists of only a single convolutional layer, it raises the question of whether such a shallow network can still be regarded as "deep" within the deep learning framework.

My personal opinion is that any network with greater than two hidden layers can be considered

“deep” My reasoning is based on previous research in ANNs that were heavily handicapped by:

1 Our lack of large, labeled datasets available for training

2 Our computers being too slow to train large neural networks

During the 1980s and 1990s, training neural networks with more than two hidden layers proved challenging due to various issues Geoff Hinton highlighted this in his 2016 talk on Deep Learning, explaining why earlier versions of artificial neural networks (ANNs) failed to gain traction during that decade.

1 Our labeled datasets were thousands of times too small.

2 Our computers were millions of times too slow.

3 We initialized the network weights in a stupid way.

4 We used the wrong type of nonlinearity activation function.

All of these reasons point to the fact that training networks with a depth larger than two hidden layers were a futile, if not a computational, impossibility.

In the current incarnation we can see that the tides have changed We now have:

3 Large, labeled datasets in the order of millions of images

4 A better understanding of weight initialization functions and what does/does not work

5 Superior activation functions and an understanding regarding why previous nonlinearity functions stagnated research

In his 2013 talk, Andrew Ng emphasized the advancements in deep learning, highlighting our capability to build deeper neural networks and train them effectively using larger datasets.

As the depth of a neural network increases, classification accuracy improves, contrasting with traditional machine learning algorithms like logistic regression, SVMs, and decision trees, which often reach a performance plateau despite additional training data This insight, inspired by Andrew Ng’s 2015 talk on deep learning, highlights the unique advantages of deep learning in leveraging larger datasets for enhanced accuracy.

[39] can be seen in Figure 2.8, providing an example of this behavior.

Figure 2.8: As the amount of data available to deep learning algorithms increases, accuracy does as well, substantially outperforming traditional feature extraction + machine learning approaches.

As the volume of training data grows, neural network algorithms achieve greater classification accuracy, while earlier methods tend to reach a performance plateau This correlation between increased accuracy and larger datasets leads to the common association of deep learning with extensive data utilization.

When working on your own deep learning applications, I suggest using the following rule of thumb to determine if your given neural network is deep:

1 Are you using aspecializednetwork architecture such as Convolutional Neural Networks, Recurrent Neural Networks, or Long Short-Term Memory (LSTM) networks? If so,yes, you are performing deep learning.

2 Does your network have a depth>2? If yes,you are doing deep learning.

If your network has a depth greater than 10, you are engaging in very deep learning It's important not to get distracted by the jargon associated with deep learning; at its core, deep learning has evolved over the past 60 years through various theories, all centered on artificial neural networks that mimic the brain's structure and function Regardless of the depth, width, or specific architecture of your network, you are still utilizing machine learning with artificial neural networks.

Summary 30

This chapter addressed the complicated question of“What is deep learning?”.

Deep learning, a subset of Artificial Neural Networks (ANNs), has its roots dating back to the 1940s, evolving through various names and interpretations influenced by different research trends and schools of thought At its essence, deep learning employs algorithms that mimic the brain's structure and functionality to identify and learn patterns.

There is no consensus amongst experts on exactly what makes a neural network “deep”; however, we know that:

1 Deep learning algorithms learn in a hierarchical fashion and therefore stack multiple layers on top of each other to learn increasingly more abstract concepts.

2 A network should have>2 layers to be considered “deep” (this is my anecdotal opinion based on decades of neural network research).

3 A network with>10 layers is consideredvery deep(although this number will change as architectures such as ResNet have been successfully trained with over 100 layers).

If you're feeling confused or overwhelmed after this chapter, that's normal; the goal was to give you a broad overview of deep learning and clarify the meaning of "deep."

This chapter introduces essential concepts such as pixels, edges, and corners, which will be further explored in the next section to establish a solid foundation Following this, we will delve into the fundamentals of neural networks, paving the way for advanced topics like deep learning and Convolutional Neural Networks While this chapter provides a high-level overview, the subsequent chapters will offer hands-on experience, enabling you to master deep learning concepts in computer vision.

To build effective image classifiers, it's essential to first grasp the fundamental concept of an image, beginning with its basic unit: the pixel.

This article explores the concept of pixels, detailing their role in image formation and how to access them as NumPy arrays, a common practice in Python image processing libraries such as OpenCV and scikit-image.

The chapter will conclude with a discussion on the aspect ratio of an image and the relation it has when preparing our image dataset for training a neural network.

Pixels: The Building Blocks of Images 31

Forming an Image From Channels

An RGB image is composed of three values corresponding to its Red, Green, and Blue components This can be visualized as three independent matrices, each representing one of the RGB components, with a width of W and a height of H By combining these three matrices, we create a multi-dimensional array that represents the full RGB image.

W×H×DwhereDis thedepthornumber of channels(for the RGB color space,D=3):

Figure 3.5: Representing an image in the RGB color space where each channel is an independent matrix, that when combined, forms the final image.

It's important to note that the depth of an image differs significantly from the depth of a neural network, a distinction that will become clearer as we begin training our own Convolutional Neural Networks For now, just recognize that most of the images you will be handling are fundamentally different in structure and complexity.

In the RGB color space, colors are represented through three channels, each ranging from 0 to 255 Each pixel in an RGB image consists of a list of three integers, corresponding to the values of Red, Green, and Blue.

• Programmatically defined as a 3D NumPy multidimensional arrays with a width, height, and depth.

An image can be visualized as a grid of pixels, similar to a piece of graph paper In this representation, the origin point (0,0) is located at the upper-left corner of the image, with both the x and y values increasing as you move down and to the right.

Figure 3.6 illustrates a graph paper representation featuring the letter "I" on an 8×8 grid, consisting of 64 pixels It's crucial to remember that in Python, counting begins at zero rather than one, which is a key distinction to avoid confusion, particularly for those transitioning from a MATLAB environment.

In Figure 3.6, the letter "I" is displayed on graph paper, illustrating how pixels are accessed using their (x,y) coordinates To locate a pixel, one moves x columns to the right and y rows down, noting that Python utilizes zero-based indexing.

Images as NumPy Arrays

Figure 3.7: Loading an image namedexample.pngfrom disk and displaying it to our screen with OpenCV.

Image processing libraries like OpenCV and scikit-image utilize multi-dimensional NumPy arrays to represent RGB images, structured as (height, width, depth) This format can confuse beginners, as it presents the height before the width, contrasting with the common perception of images being defined by width first and then height.

The answer is due to matrix notation.

When defining the dimensions of a matrix, we express it as rows x columns In the context of an image, the number of rows represents its height, while the number of columns indicates its width The depth of the image remains unchanged.

The representation of a NumPy array's shape as (height, width, depth) may initially seem confusing, but it intuitively reflects the construction and annotation of a matrix.

For example, let’s take a look at the OpenCV library and thecv2.imreadfunction used to load an image from disk and display its dimensions:

2 image = cv2.imread("example.png")

Here we load an image namedexample.pngfrom disk and display it to our screen, as the screenshot from Figure 3.7 demonstrates My terminal output follows:

The image dimensions are 300 pixels in width, 248 pixels in height, and a depth of 3 channels To retrieve a specific pixel value from the image, we can utilize straightforward NumPy array indexing.

2 (b, g, r) = image[75, 25] # accesses pixel at x%, yu

In this syntax, the value is passed before the x-value, which may seem unfamiliar initially; however, it aligns with the standard method of accessing values in a matrix, where we first indicate the row number followed by the column number This approach ultimately provides us with a tuple that represents the Red, Green, and Blue components of the image.

RGB and BGR Ordering

OpenCV stores RGB channels in reverse order, using Blue, Green, and Red instead of the conventional Red, Green, and Blue This distinction is crucial for accurate image processing and manipulation in OpenCV.

OpenCV utilizes the BGR color format primarily due to historical reasons, as early developers opted for this arrangement because it was widely adopted by camera manufacturers and software developers during that period.

The BGR ordering in OpenCV is a historical decision that we must accept, serving as a crucial consideration when utilizing the library.

Scaling and Aspect Ratios 36

Scaling, also known as resizing, involves adjusting the dimensions of an image by altering its width and height It's crucial to maintain the aspect ratio during this process to ensure the image retains its original proportions and visual integrity.

Resizing an image without maintaining its aspect ratio can result in distorted visuals, as demonstrated in Figure 3.8 The aspect ratio, defined as the ratio of an image's width to its height, is crucial for preserving the original appearance Neglecting this ratio often leads to images that appear compressed and unappealing.

The original image is displayed on the left, while the top and bottom images show distortions caused by not maintaining the aspect ratio This leads to images appearing crunched and squished To avoid such distortions, it's essential to resize images by scaling their width and height proportionally.

When resizing images for deep learning, it's essential to maintain a fixed size input for neural networks, particularly Convolutional Neural Networks (CNNs), which require uniform dimensions for all images processed Commonly used dimensions for input images include 32×32, 64×64, 224×224, 227×227, 256×256, and 299×299 While aesthetic considerations often prioritize aspect ratio, the constraints of deep learning models necessitate a focus on consistent image sizes.

When designing a network to classify 224×224 images, we face the challenge of preprocessing a dataset containing various image sizes, such as 312×234, 800×600, and 770×300 The key question is whether to ignore the aspect ratio and accept potential distortion or to implement a more effective resizing strategy One viable approach is to resize the image based on its shortest dimension and then apply a center crop to achieve the desired dimensions without compromising the image quality.

Maintaining the correct aspect ratio of an image is crucial for accurate representation; failing to do so can lead to distortion, as seen in the bottom left example Conversely, preserving the aspect ratio, as shown in the bottom right, may require cropping, which risks omitting essential parts of the image This cropping could severely impact the effectiveness of our image classification system by potentially excluding key objects that need to be identified.

The best method for image preprocessing depends on the dataset being used In some cases, it may be acceptable to distort and compress images without considering the aspect ratio However, for other datasets, it is beneficial to resize images along the shortest dimension and crop the center for optimal results.

In this book, we will explore both methods in detail, emphasizing their implementation, while also highlighting the significance of understanding image fundamentals at this stage.

In Figure 3.9, the top image displays the original input, while the bottom left illustrates the process of resizing an image to 224×224 pixels without maintaining the aspect ratio Conversely, the bottom right demonstrates resizing to 224×224 pixels by first adjusting the shortest dimension and then performing a center crop.

Summary 38

This chapter focused on the essential elements of an image, specifically the pixel It highlighted that grayscale images are defined by a single scalar representing pixel intensity or brightness Additionally, the RGB color space, the most widely used color format, represents each pixel as a 3-tuple corresponding to the Red, Green, and Blue components.

Python's computer vision and image processing libraries utilize the NumPy numerical processing library, representing images as multi-dimensional NumPy arrays with a shape defined by height, width, and depth The height, indicating the number of rows, is specified first, followed by the width, which represents the number of columns The depth corresponds to the number of channels in the image, with the RGB color space having a fixed depth of 3.

In conclusion, we explored the significance of image aspect ratio in resizing inputs for neural networks and Convolutional Neural Networks For an in-depth understanding of color spaces, image coordinate systems, resizing techniques, and foundational concepts of the OpenCV library, please consult "Practical Python and OpenCV" and the PyImageSearch Gurus course.

“A picture is worth a thousand words”– English idiom

The saying "a picture is worth a thousand words" highlights the power of visual content in conveying complex ideas From analyzing stock portfolio line charts to assessing the odds of a football game or appreciating the intricacies of a masterful painting, we continuously interpret and absorb visual information, which we later utilize to enhance our understanding and decision-making.

Interpreting images is a complex task for computers, as they perceive them merely as large matrices of numbers Unlike humans, computers lack the ability to understand the thoughts, knowledge, or meanings that images convey.

Image classification is a crucial aspect of computer vision and machine learning that involves extracting meaning from images This process ranges from simply labeling the contents of an image to interpreting it and generating human-readable descriptions As a rapidly evolving field, image classification incorporates various techniques, particularly with the rise of deep learning, leading to significant advancements and growth.

Now is the time to ride the deep learning and image classification wave – those who successfully do so will be handsomely rewarded.

Image classification and understanding are set to dominate the computer vision landscape for the next decade Major tech companies like Google, Microsoft, and Baidu are expected to rapidly acquire successful startups in this field As a result, consumers will increasingly benefit from smartphone applications capable of interpreting image content Furthermore, the future of warfare may involve unmanned aircraft guided by advanced computer vision algorithms.

This chapter offers a comprehensive overview of image classification, highlighting the various challenges faced by image classification algorithms Additionally, it examines the three primary types of learning related to image classification and machine learning.

In conclusion, this chapter outlines the four essential steps for training a deep learning network for image classification and highlights the differences between this four-step process and the conventional hand-engineered feature extraction pipeline.

What Is Image Classification? 40

A Note on Terminology

In machine learning and deep learning, a dataset serves as a collection of data points from which knowledge is extracted Each data point can represent various forms of data, including images, text, or audio.

Our objective is to utilize machine learning and deep learning algorithms to uncover hidden patterns within the dataset, allowing for accurate classification of previously unseen data points It's essential to understand this terminology now.

1 In the context of image classification, ourdatasetis acollection of images.

I’ll be using the termimageanddata pointinterchangeably throughout the rest of this book, so keep this in mind now.

Figure 4.2: A dataset (outer rectangle) is acollectionof data points (circles).

The Semantic Gap

In Figure 4.3, the two photos at the top showcase a clear distinction between a cat on the left and a dog on the right However, for a computer, these images are merely large matrices of pixels, lacking inherent understanding of the objects they represent.

The semantic gap refers to the disparity between human perception of image content and the representation of that content in a format that computers can comprehend Since computers interpret images solely as a matrix of pixels, bridging this gap is essential for enhancing machine understanding and analysis of visual data.

A quick visual comparison of the two photos highlights the differences between the two animal species; however, the computer cannot recognize that there are animals in the images To illustrate this concept further, refer to Figure 4.4, which features a serene beach photo.

We might describe the image as follows:

• Spatial:The sky is at the top of the image and the sand/ocean are at the bottom.

• Color: The sky is dark blue, the ocean water is a lighter blue than the sky, while the sand is tan.

• Texture:The sky has a relatively uniform pattern, while the sand is very coarse.

To enable computers to understand images, we utilize feature extraction, a method that quantifies image content This process involves applying an algorithm to an input image to generate a feature vector, which is essentially a numerical representation that captures the essential characteristics of the image.

To achieve effective image quantification, we can utilize hand-engineered features like HOG and LBPs, which are traditional methods However, this book advocates for a more advanced approach by employing deep learning techniques that automatically learn features for accurate image labeling and content analysis.

However, it’s not that simple because once we start examining images in the real world, we are faced with many,manychallenges.

Our brains can easily distinguish between images of a cat and a dog, while computers only interpret these images as large matrices of numbers This disparity in perception versus representation is known as the semantic gap.

Challenges

In addition to the challenges posed by the semantic gap, we must also address various factors of variation that affect the appearance of images or objects.

Viewpoint variation refers to the ability to capture an object, such as a Raspberry Pi, from multiple angles and orientations Regardless of the perspective from which it is photographed, the object remains identifiable as a Raspberry Pi.

When considering image classification, it's essential to account for scale variation, much like the different sizes of coffee cups at Starbucks—tall, grande, and venti—despite being the same beverage This scale difference becomes even more pronounced in photographs, where a venti coffee can appear vastly different depending on the distance from which it's captured Additionally, one of the most challenging variations to manage is deformation A prime example is Gumby, the elastic character from the TV series, who can contort into various poses While all images feature Gumby, the significant differences in their appearances highlight the complexities of object deformation in image classification.

Our image classification should also be able to handleocclusions, where large parts of the

In analyzing the images, we can focus on the spatial layout, color, and texture, similar to how computer vision algorithms operate For instance, Figure 4.5 illustrates a dog in two different scenarios: one image shows the dog clearly, while the other depicts the dog partially obscured under covers Despite the occlusion, image classification algorithms should effectively identify and label the dog’s presence in both images.

Handling changes in illumination poses challenges similar to those of deformations and occlusions For instance, consider the coffee cup depicted in both standard and low lighting conditions The image taken under standard overhead lighting shows the cup differently than the one captured in minimal light Notably, the vertical cardboard seam of the cup becomes distinctly visible in low lighting, highlighting how lighting conditions can dramatically alter the appearance of the same object.

When considering background clutter, think of the game "Where's Waldo?" This game illustrates the challenge of identifying a specific object amidst a chaotic scene filled with distractions The images are dense with details, making it difficult to spot Waldo, our red-and-white striped friend This concept of "noise" in visuals highlights the challenges faced by computers, which lack the semantic understanding necessary to discern specific elements in a cluttered image.

Intra-class variation in computer vision refers to the differences within a category, exemplified by the diverse types of chairs From cozy reading chairs to functional dining chairs and stylish art deco designs, all these variations still fall under the umbrella of "chairs." For effective image classification, algorithms must accurately identify and categorize these distinct forms of a single class.

Building an image classifier can be overwhelming due to its complexity The challenge intensifies as the system must not only be resilient to individual variations but also effectively manage multiple variations occurring simultaneously.

To account for the vast array of variations in objects and images, we must carefully define the problem at hand This involves making informed assumptions about the content of our images and identifying the specific variations we are willing to accept Additionally, we take into account the overall scope of our analysis to ensure comprehensive understanding and accurate results.

When creating an image classification system, it's crucial to consider the various factors that can affect object appearance, such as different viewpoints, lighting conditions, occlusions, and scale Understanding these elements is essential to define the project's end goal and to clarify what we aim to build.

Successful computer vision, image classification, and deep learning systems deployed to the real-world makecareful assumptions and considerationsbefore a single line of code is ever written.

Taking a broad approach, like attempting to classify and detect every object in your kitchen, can hinder the performance of your classification system With potentially hundreds of objects to identify, success is unlikely without extensive experience in building image classifiers, and even then, there are no guarantees of achieving your project goals.

To improve accuracy in image classification and deep learning, it's essential to narrow the focus of your problem For instance, specifying "I want to recognize just stoves and refrigerators" enhances the effectiveness of your system, particularly if you are a beginner in this field.

When developing an image classifier, it's crucial to define the project's scope clearly Although deep learning and Convolutional Neural Networks (CNNs) show impressive robustness and classification capabilities across diverse challenges, maintaining a focused scope will enhance the effectiveness of your project.

Types of Learning 45

Supervised Learning

After graduating with a Bachelor of Science in Computer Science, many young graduates find themselves feeling overwhelmed and financially strained as they embark on their job search in the tech industry.

But before you know it, a Google recruiter finds you on LinkedIn and offers you a position working on their Gmail software Are you going to take it? Most likely.

A few weeks later, you arrive at Google's impressive campus in Mountain View, California, captivated by the stunning scenery, the array of Teslas parked nearby, and the seemingly endless selection of gourmet food available in the cafeteria.

After settling into your desk in a bustling open workspace filled with numerous colleagues, you learn about your new position Your responsibility is to develop software that can automatically categorize emails as either spam or not spam.

To achieve the goal of identifying spam emails, one might consider a rule-based approach using if/else statements to search for specific keywords While this method could be somewhat effective, it is ultimately limited and prone to failure, making it difficult to sustain over time.

To effectively filter emails as spam or not, machine learning is essential By using a training set of emails paired with their corresponding labels, such as spam or not-spam, you can analyze the text and word distributions within the emails This data allows a machine learning classifier to learn which words are indicative of spam, eliminating the need for complex manual coding with if/else statements.

Creating a spam filter system exemplifies supervised learning, which is the most recognized and extensively researched type of machine learning In this approach, a model, often referred to as a "classifier," is developed using training data The model makes predictions based on the input data and is subsequently corrected for any inaccuracies This iterative training process continues until the model meets a predefined stopping criterion, such as achieving a low error rate or reaching a specified number of training iterations.

Common supervised learning algorithms include Logistic Regression, Support Vector Machines (SVMs) [43, 44], Random Forests [45], and Artificial Neural Networks.

In the context ofimage classification, we assume our image dataset consists of the images themselves along with their correspondingclass labelthat we can use to teach our machine learning

Label R à G à B à R σ G σ B σ Cat 57.61 41.36 123.44 158.33 149.86 93.33 Cat 120.23 121.59 181.43 145.58 69.13 116.91 Cat 124.15 193.35 65.77 23.63 193.74 162.70 Dog 100.28 163.82 104.81 19.62 117.07 21.11 Dog 177.43 22.31 149.49 197.41 18.99 187.78 Dog 149.73 87.17 187.97 50.27 87.15 36.65

Table 4.1 presents a dataset featuring class labels of either "dog" or "cat" alongside their corresponding feature vectors, which include the mean and standard deviation for each color channel (Red, Green, and Blue) This dataset exemplifies a supervised classification task, enabling the classifier to learn the characteristics of each category In cases where the classifier mispredicts, we can implement corrective methods to address its errors.

Understanding the distinctions between supervised, unsupervised, and semi-supervised learning is facilitated by examining the example in Table 4.1 In this table, the first column represents the label linked to a specific image, while the subsequent six columns illustrate the feature vector for each data point, which is determined by calculating the mean and standard deviation for each RGB color channel.

Our supervised learning algorithm predicts outcomes based on feature vectors, and when it makes an incorrect prediction, we provide the correct label to help improve its accuracy This iterative process continues until we reach a predetermined stopping criterion, which may include achieving a specific accuracy level, completing a set number of iterations, or adhering to a designated time limit.

R To explain the differences between supervised, unsupervised, and semi-supervised learning,

In my analysis, I opted for a feature-based approach, utilizing the mean and standard deviation of the RGB color channels to quantify image content However, when we delve into Convolutional Neural Networks, we bypass the feature extraction step and directly use the raw pixel intensities Given that images are large M × N matrices, which do not fit well into a spreadsheet format, the feature-extraction process aids in visualizing the distinctions between different learning methods.

Unsupervised Learning

Unsupervised learning, often referred to as self-taught learning, differs from supervised learning in that it operates without labeled input data As a result, it lacks the ability to correct predictions when the model makes errors.

Going back to the spreadsheet example, converting a supervised learning problem to an unsupervised learning one is as simple as removing the “label” column (Table 4.2).

Unsupervised learning is often regarded as the "holy grail" of machine learning and image classification due to the vast amount of unlabeled data available online, such as the millions of images on Flickr and videos on YouTube By enabling algorithms to learn patterns from this unlabeled data, we can significantly reduce the time and cost associated with manually labeling images for supervised learning tasks.

Unsupervised learning algorithms excel at uncovering the underlying structure of datasets This understanding can then be leveraged to enhance supervised learning tasks, particularly in situations where labeled data is scarce.

Classic machine learning algorithms for unsupervised learning include Principle ComponentAnalysis (PCA) and k-means clustering Specific to neural networks, we see Autoencoders, Self-

Unsupervised learning algorithms focus on identifying hidden patterns within datasets that lack class labels By eliminating the class label column, the task is transformed into an unsupervised learning challenge.

In semi-supervised learning, we have labels available for only a portion of the images or feature vectors, necessitating the need to label the remaining data points This process allows us to leverage additional unlabelled data as extra training material, enhancing the overall model performance.

Self-Organizing Maps (SOMs) and Adaptive Resonance Theory are key techniques in the field of unsupervised learning, which remains a vibrant and ongoing area of research However, this book does not concentrate on unsupervised learning methodologies.

Semi-supervised Learning

When we have only partial labels for our data, we can still effectively classify the data points by utilizing semi-supervised learning This approach combines elements of both supervised and unsupervised learning, enabling us to leverage the available labeled data while also making use of the unlabeled data to enhance classification accuracy.

In our spreadsheet example, we have labels for only a small portion of our input data The semi-supervised learning algorithm analyzes the known data, aiming to label the unlabeled data points to enhance training This iterative process allows the algorithm to understand the data's structure, leading to improved prediction accuracy and the generation of more reliable training data.

Semi-supervised learning is a valuable approach in computer vision, addressing the challenges of time-consuming and costly image labeling When resources are limited, this technique allows us to label only a small subset of our training images while efficiently classifying the remaining data, making it an effective solution for large datasets.

Semi-supervised learning algorithms typically sacrifice some classification accuracy in exchange for using smaller labeled input datasets Generally, the accuracy of predictions made by supervised learning algorithms, particularly deep learning models, improves with the availability of more accurately labeled training data.

As training data diminishes, accuracy tends to decline; however, semi-supervised learning addresses this challenge by balancing classification accuracy with reduced training data requirements This approach enables the development of accurate classifiers, although typically not as precise as those produced through supervised learning, while minimizing effort and data needed Common methods in semi-supervised learning include label spreading, label propagation, ladder networks, and co-learning/co-training.

This book will mainly concentrate on supervised learning, as unsupervised and semi-supervised learning in deep learning for computer vision remain active research areas lacking definitive guidelines on effective methods.

The Deep Learning Classification Pipeline 48

ImageNet 58

Libraries and Packages 63

Working with Image Datasets 67

k-NN: A Simple Classifier 72

An Introduction to Linear Classification 82

The Role of Loss Functions 88

Gradient Descent 96

Extensions to SGD 111

Regularization 113

Neural Network Basics 121

Understanding Convolutions 170

CNN Building Blocks 179

Keras Configurations and Converting Images to Arrays 197

ShallowNet 200

The VGG Family of Networks 229

MiniVGGNet on CIFAR-10 234

Dropping Our Learning Rate 241

Monitoring the Training Process 255

The Importance of Architecture Visualization 271

State-of-the-art CNNs in Keras 277

Classifying Images with Pre-trained ImageNet CNNs 281

Breaking Captchas with a CNN 288

Ngày đăng: 19/09/2021, 20:04

TỪ KHÓA LIÊN QUAN

w