Can a machine identify a bee as a honey bee or a bumble bee?
Required Libraries
from pathlib import Path
from PIL import Image
from IPython.display import display
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
test_data = np.random.beta(1, 1, size=(100, 100, 3))
plt.imshow(test_data)
Pillow is a very flexible image loading and manipulation library.
# open the image
img = Image.open('datasets/bee_1.jpg')
# Get the image size
img_size = img.size
print("The image size is: {}".format(img_size))
#display image
img
>>> The image size is: (100, 100)
Pillow has a number of common image manipulation tasks built into the library.
- resizing
- cropping
- rotating
- flipping
- converting to greyscale
img_cropped = img.crop((25, 25, 75, 75))
display(img_cropped)
img_rotated = img.rotate(45,expand=25)
display(img_rotated)
img_flipped = img.transpose(Image.FLIP_LEFT_RIGHT)
display(img_flipped)
Most image formats have three color "channels": red, green, and blue (some images also have a fourth channel called "alpha" that controls transparency). The way this is represented as data is as a three-dimensional matrix.
img_data = np.array(img)
img_data_shape = img_data.shape
print("Our NumPy array has the shape: {}".format(img_data_shape))
plt.imshow(img_data)
plt.show()
# plot the red channel
plt.imshow(img_data[:,:,0], cmap=plt.cm.Reds_r)
plt.show()
# plot the green channel
plt.imshow(img_data[:,:,1], cmap=plt.cm.Greens_r)
plt.show()
# plot the blue channel
plt.imshow(img_data[:,:,2], cmap=plt.cm.Blues_r)
plt.show()
>>> Our NumPy array has the shape: (100, 100, 3)
Color channels can help provide more information about an image. This kind of information can be useful when building models or examining the differences between images.
Let's look at the kernel density estimate for each of the color channels on the same plot so that we can understand how they differ.
def plot_kde(channel, color):
""" Plots a kernel density estimate for the given data.
`channel` must be a 2d array
`color` must be a color string, e.g. 'r', 'g', or 'b'
"""
data = channel.flatten()
return pd.Series(data).plot.density(c=color)
# create the list of channels
channels = ['r', 'g', 'b']
def plot_rgb(image_data):
# use enumerate to loop over colors and indexes
for ix, color in enumerate(channels):
plot_kde(img_data[:, :, ix], color)
plt.show()
plot_rgb(img_data)
Now we'll look at two different images and some of the differences between them.
honey = Image.open('datasets/bee_12.jpg')
bumble = Image.open('datasets/bee_3.jpg')
display(honey)
display(bumble)
honey_data = np.array(honey)
bumble_data = np.array(bumble)
plot_rgb(honey_data)
plot_rgb(bumble_data)
We know that the colors of the flowers may be distracting from separating honey bees from bumble bees, so let's convert these images to black-and-white, or "grayscale." Because we change the number of color "channels," the shape of our array changes with this change.
# convert honey to grayscale
honey_bw = honey.convert("L")
# convert the image to a NumPy array
honey_bw_arr = np.array(honey_bw)
honey_bw_arr_shape = honey_bw_arr.shape
print("Our NumPy array has the shape: {}".format(honey_bw_arr_shape))
plt.imshow(honey_bw_arr, cmap=plt.cm.gray)
plt.show()
plot_kde(honey_bw_arr, 'k')
>>> Our NumPy array has the shape: (100, 100)
import os
import matplotlib as mpl
import matplotlib.pyplot as plt
from IPython.display import display
%matplotlib inline
import pandas as pd
import numpy as np
from PIL import Image
from skimage.feature import hog
from skimage.color import rgb2grey
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import roc_curve, auc, accuracy_score
Let's load load our labels.csv
file into a dataframe called labels, where the index
is the image name (e.g. an index of 1036 refers to an image named 1036.jpg) and the genus column tells us the bee type. genus
takes the value of either 0.0
(Apis or honey bee) or 1.0
(Bombus or bumble bee).
labels = pd.read_csv("datasets/labels.csv", index_col=0)
display(labels.head())
def get_image(row_id, root="datasets/"):
"""
Converts an image number into the file path where the image is located,
opens the image, and returns the image as a numpy array.
"""
filename = "{}.jpg".format(row_id)
file_path = os.path.join(root, filename)
img = Image.open(file_path)
return np.array(img)
# subset the dataframe to just Apis (genus is 0.0) get the value of the sixth item in the index
apis_row = labels[labels.genus == 0.0].index[5]
# show the corresponding image of an Apis
plt.imshow(get_image(apis_row))
plt.show()
# subset the dataframe to just Bombus (genus is 1.0) get the value of the sixth item in the index
bombus_row = labels[labels.genus == 1.0].index[5]
# show the corresponding image of a Bombus
plt.imshow(get_image(bombus_row))
plt.show()
genus | |
---|---|
id | |
520 | 1.0 |
3800 | 1.0 |
3289 | 1.0 |
2695 | 1.0 |
4922 | 1.0 |
The rgb2grey function computes the luminance of an RGB image using the following formula Y = 0.2125 R + 0.7154 G + 0.0721 B.
bombus = get_image(bombus_row)
grey_bombus = rgb2grey(bombus)
The idea behind HOG is that an object's shape within an image can be inferred by its edges, and a way to identify edges is by looking at the direction of intensity gradients (i.e. changes in luminescence).
hog_features, hog_image = hog(grey_bombus,
visualize=True,
block_norm='L2-Hys',
pixels_per_cell=(16, 16))
# show our hog_image with a grey colormap
plt.imshow(hog_image, cmap=mpl.cm.gray)
Algorithms require data to be in a format where rows correspond to images and columns correspond to features. This means that all the information for a given image needs to be contained in a single row. We want to provide our model with the raw pixel values from our images as well as the HOG features we just calculated.
Let's generate a flattened features array for the bombus image
def create_features(img):
color_features = img.flatten()
# convert image to greyscale
grey_image = rgb2grey(img)
# get HOG features from greyscale image
hog_features = hog(grey_image, block_norm='L2-Hys', pixels_per_cell=(16, 16))
# combine color and hog features into a single array
flat_features = np.hstack([color_features, hog_features])
return flat_features
bombus_features = create_features(bombus)
# print shape of bombus_features
bombus_features.shape
>>> (31296,)
Now it's time to loop over all of our images. We will create features for each image and then stack the flattened features arrays into a big matrix we can pass into our model.
def create_feature_matrix(label_dataframe):
features_list = []
for img_id in label_dataframe.index:
img = get_image(img_id)
image_features = create_features(img)
features_list.append(image_features)
feature_matrix = np.array(features_list)
return feature_matrix
feature_matrix = create_feature_matrix(labels)
feature_matrix.shape # rows correspond to images and columns to features.
>>> (500, 31296)
Many machine learning methods are built to work best with data that has a mean of 0 and unit variance. Luckily there is StandardScaler
method.
Also that we have over 31,000
features for each image and only 500
images total. To use an SVM, our model, we also need to reduce the number of features we have using principal component analysis (PCA).
print('Feature matrix shape is: ', feature_matrix.shape)
scaler = StandardScaler()
bees_stand = scaler.fit_transform(feature_matrix)
pca = PCA(n_components=500)
# use fit_transform to run PCA on our standardized matrix
bees_pca = pca.fit_transform(bees_stand)
print('PCA matrix shape is: ', bees_pca.shape)
>>> Feature matrix shape is: (500, 31296)
>>> PCA matrix shape is: (500, 500)
We'll use 70% of images as our training data and test our model on the remaining 30%.
X_train, X_test, y_train, y_test = train_test_split(bees_pca,
labels.genus.values,
test_size=.3,
random_state=1234123)
We'll use a support vector machine (SVM), a type of supervised machine learning model. Since we have a classification task -- honey or bumble bee -- we will use the support vector classifier (SVC).
# define support vector classifier
svm = SVC(kernel='linear', probability=True, random_state=42)
# fit model
svm.fit(X_train, y_train)
And finally, let's see how accurate is our model. Accuracy is the number of correct predictions divided by the total number of predictions.
y_pred = svm.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracy
>>> 0.68
The receiver operating characteristic curve (ROC curve) plots the false positive rate and true positive rate at different thresholds. ROC curves are judged visually by how close they are to the upper lefthand corner.
The area under the curve (AUC) is also calculated, where 1 means every predicted label was correct.
probabilities = svm.predict_proba(X_test)
# select the probabilities for label 1.0
y_proba = probabilities[:, 1]
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_proba, pos_label=1)
roc_auc = auc(false_positive_rate, true_positive_rate)
plt.title('Receiver Operating Characteristic')
roc_plot = plt.plot(false_positive_rate,
true_positive_rate,
label='AUC = {:0.2f}'.format(roc_auc))
plt.legend(loc=0)
plt.plot([0,1], [0,1], ls='--')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate');