Automated species identification with python

Using machine learning to automate species recognition.

July 6, 2022

The code and data for this project is available on github.

Introduction

Working on ecosystem stability, my PhD led me to follow the dynamics of microcosms – artificial, simplified ecosystem made of vials and tubes and inhabited by different species of unicellular organisms. This entailed identifying and counting hundreds of thousands of individuals in more than 250 vials at regular intervals over the course of a month.

Thankfully, automation of the identification process allowed me to realize this otherwise daunting task. I took video samples regularly which I analyzed with the Bemovi R-package that tracks individuals (moving particles) and extracts information on them such as their size, shape and trajectory properties. Bemovi outputs a dataframe where each row is a single individual and columns describe their properties which is well suited for classification methods. I could thus train a machine learning algorithm on monospecific cultures (videos with only one species so we are sure about the species identity) and then use it to predict species identity in plurispecific cultures.

At the time, I automated species recognition in R by slapping a Random Forest algorithm with default parameters on my data, checking visually that this worked most of the time and called it a day.

Now that I know better, I would like to come back to this data and do better by benchmarking different ML algorithm and estimating accuracy by splitting the data into a training set and a testing set. I will also further assess the robustness of each method by checking its accuracy on training datasets obtained at later dates, and by estimating accuracy on a dataset with added noise. In this post, I will guide you through the automation process starting from the output of Bemovi (a dataframe describing the properties of each tracked particle) using python and Sklearn.

Data importation and preparation

Let us first import a few libraries and import our training data set:

## libraries importation

# data manipulation and maths
import pandas as pd
import numpy as np

# plot libraries
import matplotlib.pyplot as plt
import seaborn as sns

# importing data and looking at the first few rows
## loading data
data = pd.read_csv('datasets/morph_data_t0.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15885 entries, 0 to 15884
Data columns (total 35 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   file             15885 non-null  object 
 1   mean_grey        15885 non-null  float64
 2   sd_grey          15885 non-null  float64
 3   mean_area        15885 non-null  float64
 4   sd_area          15885 non-null  float64
 5   mean_perimeter   15885 non-null  float64
 6   sd_perimeter     15885 non-null  float64
 7   mean_major       15885 non-null  float64
 8   sd_major         15885 non-null  float64
 9   mean_minor       15885 non-null  float64
 10  sd_minor         15885 non-null  float64
 11  mean_ar          15885 non-null  float64
 12  sd_ar            15885 non-null  float64
 13  mean_turning     15885 non-null  float64
 14  sd_turning       15885 non-null  float64
 15  duration         15885 non-null  float64
 16  N_frames         15885 non-null  int64  
 17  max_net          15885 non-null  float64
 18  net_disp         15885 non-null  int64  
 19  net_speed        15885 non-null  float64
 20  gross_disp       15885 non-null  float64
 21  gross_speed      15885 non-null  float64
 22  max_step         15885 non-null  float64
 23  min_step         15885 non-null  float64
 24  sd_step          15885 non-null  float64
 25  sd_gross_speed   15885 non-null  float64
 26  max_gross_speed  15885 non-null  float64
 27  min_gross_speed  15885 non-null  float64
 28  id               15885 non-null  object 
 29  date             15885 non-null  object 
 30  time             15885 non-null  object 
 31  magnification    15885 non-null  int64  
 32  tube_id          15885 non-null  int64  
 33  measure_point    15885 non-null  object 
 34  community        15885 non-null  object 
dtypes: float64(25), int64(4), object(6)
memory usage: 4.2+ MB

This dataframe was obtained by using bemovi package, which tracks individuals from video files. It outputs a dataframe where each line describes the properties of a tracked particle (here, a unicellular organism) such as its area, size or color. Our target variable is contained in the “Community” column (but we’ll see later that we also need to transform this variable).

Some of the columns are meta-data that will not be usefull for the identification:

‘file’: the name of the video file from whitch the individual was tracked.
‘id’: a unique number identifying the individual.
‘date’, ’time: the date and time of video recording.
‘magnification’: microscope setting. Not informative since we use the same magnification for the whole experiment.
’tube_id’: a unique number identifying the falcon tube from which the sample was taken.
‘measure_point’: a categorical variable describing the order of measurements.

Let’s get rid of these columns:

# dropping useless columns for id:
data.drop(['file',
           'id',
           'date',
           'time',
           'magnification',
           'tube_id',
           'measure_point'],
          axis = 1, inplace = True)

A first thing we can do is take a look at how the variables relate to each other and to the target variable. Because we have a lot of variables (28 columns) and observations (15885 rows), we’ll only take a look at a few variables on a subset of the data (1000 rows). Using a pairplot, let’s take a look at the following variables of interest, grouped by species id:

‘mean_grey’: average grey level of an individual.
‘mean_area’: average area of an individual.
‘mean_ar’: mean aspect ratio (length relative to width), describes the individuals shape.
‘duration’: time during which an individual was tracked. Small and fast individuals tend to get ’lost’ by the tracking algorithm and have a shorter duration.
‘gross_speed’: speed of the individual

# sampling part of the data and looking at pairplot for a few values that we think could be of interest
data_sample =  data[['mean_grey', # particle color
                     'mean_area', # particle area
                     'mean_ar', # particle elongation (Aspect Ratio)
                     'duration',
                     'gross_speed',
       'community']].sample(1000)
sns.pairplot(data_sample, hue = 'community')

<seaborn.axisgrid.PairGrid at 0x7f25cab258b0>

png

We can already see that the species are well separated by some variables such as their size (‘mean_area’) and speed (‘gross_speed’).

It is also worth noting that the variables are on very different scales (e.g., mean_area: 0-35000 vs. mean_grey: 100-250) so we’ll need to rescale them. We also note quite a lot of outliers, so we’ll want to make sure that we use a scaler that is not sensitive to outliers.

Lastly, we see that Blepharisma sp. are a bit underrepresented so it’s hard to make them out. Lets look at them separately:

# sampling part of the data and looking at pairplot for a few values that we think could be of interest
data_sample_ble =  data[data.community == 'Ble'][['mean_grey', 'mean_area', 'mean_ar',
       'duration', 'gross_speed',
       'community']]

sns.pairplot(data = data_sample_ble, hue = 'community')

<seaborn.axisgrid.PairGrid at 0x7f25c5a56d90>

png

As you can see, the color and size of Blepharisma are bimodal. Taking a look at the video, you could see that Blepharisma sp. are so big that they cast a reflection on the bottom of the microscope slide which gets picked up by the tracking algorithm. Originally, I arbitrarily took all Blepharisma sp. with a mean_grey smaller than 180 and gave them the new label ‘Ble_shadow’. Here, we’ll try to relabel them using a clustering algorithm to be more robust (i.e., if next week we change the microscope’s light bulb, the grey threshold might not be 180 for future videos). We do that by running a clustering algorithm on the grey level and area of particles labelled ‘Ble’:

# importing a scaler and clustering algorithm
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import AgglomerativeClustering

# extracting Blepharisma's area and color
X_ble = data[data.community == 'Ble'][['mean_grey', 'mean_area']]

# scaling it
X_ble_scaled = MinMaxScaler().fit_transform(X_ble)

# running an agglomerative clustering on it and storing the cluster id in data_sample_ble
model_ble = AgglomerativeClustering(n_clusters = 2, linkage = 'complete')
data_sample_ble['cluster'] = model_ble.fit_predict(X_ble_scaled)

# Visually checking the new clusters:
sns.pairplot(data_sample_ble[['mean_grey', 'mean_area', 'cluster']], hue = 'cluster')

<seaborn.axisgrid.PairGrid at 0x7f25bdfbc760>

png

This seems to work quite well ! All that’s left to do is to associate cluster 1 (the smallest individuals) to the label ‘Ble_shadow’ and then port that to the main data into a new column that we’ll call ‘species’:

# associating each cluster to the right label
data_sample_ble['species'] = data_sample_ble.cluster.apply(lambda x: x*'Ble_shadow' + (x-1)**2 * 'Ble')

# Putting that to the full data frame
data['species'] = data.community
data.loc[data.community == 'Ble', 'species'] = data_sample_ble['species']

Finally, we can drop the column ‘community’, and split our data between training and testing set and we’ll be ready to fit !

# separating features from labels
y = data.species.values
X = data.drop(['community', 'species'], axis = 1).values

# train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

Model fitting

First let’s import some libraries:

## Utility functions
from sklearn.pipeline import Pipeline

#æ parmeter tuning and model validation
from sklearn.model_selection import train_test_split, GridSearchCV

## feature scaling
from sklearn.preprocessing import RobustScaler

## models
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import LinearSVC
from sklearn.linear_model import RidgeClassifier

# to time the fit of each function
import time
from datetime import timedelta

Now we declare a list of models and parameter grids to explore for each:

names = ['knn', 'random_forest',
         'neuralnet', 'SVC', 'ridge']
classifiers = [KNeighborsClassifier(), RandomForestClassifier(),
               MLPClassifier(max_iter = 500), LinearSVC(), RidgeClassifier()]
params = [{'knn__n_neighbors': range(1, 10)},
          {'random_forest__n_estimators': [50, 100, 200]},
          {'neuralnet__hidden_layer_sizes': [(80), (40, 10), (30, 10, 5)]},
          {'SVC__C': [0.5, 1.0, 2.0]},
          {'ridge__alpha': [0.5, 1.0, 2.0, 5.0, 10.0]}
]

Lastly, we fit each model to the training data and extract the accuracy score on the test data:

model_cv = []
scores_list = []
times = []
for k in range(len(names)):
    model_cv.append(GridSearchCV(Pipeline([('scaler', RobustScaler()), (names[k], classifiers[k])]),
                               params[k], cv = 5))
    start_time = time.monotonic()
    model_cv[k].fit(X_train, y_train)
    end_time = time.monotonic()
    scores_list.append(model_cv[k].score(X_test, y_test))
    times.append(timedelta(seconds=end_time - start_time))

/usr/local/lib/python3.8/dist-packages/sklearn/svm/_base.py:1206: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/sklearn/svm/_base.py:1206: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/sklearn/svm/_base.py:1206: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/sklearn/svm/_base.py:1206: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/sklearn/svm/_base.py:1206: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/sklearn/svm/_base.py:1206: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/sklearn/svm/_base.py:1206: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/sklearn/svm/_base.py:1206: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/sklearn/svm/_base.py:1206: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/sklearn/svm/_base.py:1206: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/sklearn/svm/_base.py:1206: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/sklearn/svm/_base.py:1206: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/sklearn/svm/_base.py:1206: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/sklearn/svm/_base.py:1206: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/sklearn/svm/_base.py:1206: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/sklearn/svm/_base.py:1206: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(

scores = pd.DataFrame({'model': names, 'test_score' : scores_list, 'train_score' : [m.best_score_ for m in model_cv], 'time' : [t.seconds for t in times]})
scores.sort_values('test_score', inplace = True, ascending = False)

Comparing the accuracy of different models, testing on the left and training on the right:

fig, ax = plt.subplots(figsize=(8, 5), ncols=2, sharey=True)
#fig.tight_layout()

sns.barplot(ax = ax[0], x="test_score", y="model", data=scores, hue = 'test_score', dodge=False, palette = 'flare')
ax[0].legend_.remove()
ax[0].set_xlim(0.98, 1)
ax[0].set_title('Testing accuracy')
ax[0].invert_xaxis()
ax[0].set(ylabel = 'Model', xlabel = '')

sns.barplot(ax=ax[1], x="train_score", y="model", data=scores, hue = 'train_score', dodge=False, palette = 'crest')
ax[1].legend_.remove()
ax[1].set_xlim(0.98, 1)
ax[1].set_title('Training accuracy')
ax[1].set(ylabel = '', xlabel = '')
ax[1].yaxis.tick_right()
ax[1].set_xticks(ax[1].get_xticks()[1:])

plt.subplots_adjust(wspace=0)

png

Checking the runtime of each model (right):

fig, ax = plt.subplots(figsize=(8, 5), ncols=2, sharey=True)
#fig.tight_layout()

sns.barplot(ax = ax[0], x="test_score", y="model", data=scores, hue = 'test_score', dodge=False, palette = 'flare')
ax[0].legend_.remove()
ax[0].set_xlim(0.98, 1)
ax[0].set_title('Testing accuracy')
ax[0].invert_xaxis()
ax[0].set(ylabel = 'Model', xlabel = '')
ax[0].set_xticks(ax[0].get_xticks()[1:])

sns.barplot(ax=ax[1], x="time", y="model", data=scores, hue = 'time', dodge=False, palette = 'crest')
ax[1].legend_.remove()
ax[1].set_title('Fitting time')
ax[1].set(ylabel = '', xlabel = '')
ax[1].yaxis.tick_right()
ax[1].set_xticks(ax[1].get_xticks()[1:])

plt.subplots_adjust(wspace=0)

png

All models are quite good with more than 99% accuracy on the test set. SVC is very efficient given its fast training time but I’m a bit worried by the convergence warnings. Also consider that models with few parameters (SVC, knn and ridge) are probably close to their best performance already but neural net and random forest could probably be improved through hyperparameter fine tuning.

If 1% error is not an issue for you, I would recommend using knn or ridge as they are already very accurate but are also super easy and fast to tune. If having a very high accuracy matters, I would recommend using a neuralnet and taking some more time to fine tune the parameters.

However, I’d like to make 2 final tests to check that our models are efficient on tricky data to make sure that they generalize well and get an idea of the worst performance we can expect.

Model assessment on tricky data

To check the robustness of the different models by using tricky test data, namely:

on data taken another day, as unicellular organisms are fairly plastic and their traits can vary over time.
on noisy data, by applying some white noise to the individual’s traits.

Data importation automation

In order to test the models on more data, I need to declare a function that automates data processing (dropping useless columns, classifying Ble/Ble_Shadow).

def clean_data(data):
    # dropping useless columns for id:
    data.drop(['file', # name of the video file where the particle comes from
           'id',   # id of the particle
           'date', # date of the video
           'time', # hour of the video
           'magnification', # microscope setting
           'tube_id', # id of the tube the sample comes from
           'measure_point'], # a categorical variable that describe which measurement it was in the experiment.
          axis = 1, inplace = True)
    ## identifying Ble_shadows
    # extracting Blepharisma's area and color
    X_ble = data[data.community == 'Ble'][['mean_grey', 'mean_area']]

    # scaling it
    X_ble_scaled = MinMaxScaler().fit_transform(X_ble)

    # running an agglomerative clustering on it and storing the cluster id in data_sample_ble
    model_ble = AgglomerativeClustering(n_clusters = 2, linkage = 'complete')
    cluster = pd.Series(model_ble.fit_predict(X_ble_scaled))
    
    # associating each cluster to the right label
    species = cluster.apply(lambda x: x*'Ble_shadow' + (x-1)**2 * 'Ble')

    # Putting that to the full data frame
    data['species'] = data.community
    data.loc[data.community == 'Ble', 'species'] = species
    
    return None

fitting the models to all data points

The data I used to fit the model was taken on the first day of the experiment (‘datasets/morph_data_t0.csv’), but we also have 11 more data sets (t1 to t11) taken in the following days. I will now test the models on this data to assess their robustness:

# Creating a dataframe that will store the accuracy of each model on each data set
temp_df = pd.DataFrame({'model':names})

## looping over all times
for time in range(1, 12):
    file = 'datasets/morph_data_t' + str(time) + '.csv'
    temp_data = pd.read_csv(file)
    clean_data(temp_data)
    # separating features from labels
    y_temp = temp_data.species.values
    X_temp = temp_data.drop(['community', 'species'], axis = 1).values
    
    
    ## Creating a list of scores for this time step
    
    t_score = []
    
    ## fitting each model at given time
    for model_id in range(len(names)):
        t_score.append(model_cv[model_id].score(X_temp, y_temp))

    temp_df['t_' + str(time)] = t_score

temp_df

	model	t_1	t_2	t_3	t_4	t_5	t_6	t_7	t_8	t_9	t_10	t_11
0	knn	0.987349	0.984904	0.985214	0.946599	0.957288	0.987789	0.976901	0.965896	0.974500	0.936820	0.970832
1	random_forest	0.990536	0.991766	0.991418	0.955245	0.962504	0.989417	0.986051	0.985130	0.984590	0.948536	0.978935
2	neuralnet	0.992854	0.993666	0.994520	0.955372	0.963913	0.991859	0.986351	0.974947	0.986975	0.950209	0.981568
3	SVC	0.991695	0.989760	0.993072	0.952956	0.964336	0.991534	0.980501	0.965896	0.984223	0.948117	0.978935
4	ridge	0.988218	0.984904	0.988109	0.953592	0.959825	0.981276	0.973751	0.977534	0.984407	0.950209	0.975289

We can now aggregate this dataframe to get the average accuracy over all data points:

temp_df_long = temp_df.melt(id_vars='model', value_vars=['t_' + str(k) for k in range(1,12)], var_name = 'Measurement', value_name = 'Accuracy')
mean_sd_scores = temp_df_long.groupby('model').agg(Accuracy_mean = ('Accuracy', 'mean'), Accuracy_std = ('Accuracy', 'std'))
mean_sd_scores.reset_index(inplace = True)

scores = scores.merge(mean_sd_scores, on = 'model')

scores

	model	test_score	train_score	time	Accuracy_mean	Accuracy_std
0	neuralnet	0.996852	0.997718	78	0.979294	0.016024
1	SVC	0.996223	0.997718	4	0.976457	0.016153
2	knn	0.994649	0.994098	40	0.970372	0.017219
3	random_forest	0.994649	0.996695	63	0.978557	0.015619
4	ridge	0.992446	0.992996	3	0.974283	0.013704

Let us plot the average accuracy of the different models:

fig, ax = plt.subplots(figsize=(8, 5), ncols=2, sharey=True)
#fig.tight_layout()

sns.barplot(ax = ax[0], x="test_score", y="model", data=scores, hue = 'test_score', dodge=False, palette = 'flare')
ax[0].legend_.remove()
ax[0].set_xlim(0.95, 1)
ax[0].set_title('Testing accuracy')
ax[0].invert_xaxis()
ax[0].set(ylabel = 'Model', xlabel = '')

sns.barplot(ax=ax[1], x="Accuracy_mean", y="model", data=scores, hue = 'Accuracy_mean', dodge=False, palette = 'crest')
ax[1].legend_.remove()
ax[1].set_xlim(0.95, 1)
ax[1].set_title('Testing accuracy for t > 1')
ax[1].set(ylabel = '', xlabel = '')
ax[1].yaxis.tick_right()
ax[1].set_xticks(ax[1].get_xticks()[1:])

plt.subplots_adjust(wspace=0)

png

fitting the models to noisy data

Lastly, I want to artificially create tricky data by applying some white noise to the data. I first define a function that returns noisy data from clean data, and then test the models over different noise amplitudes (sd).

# defining a function that adds noise to data
def add_noise(X, sd):
    noise_array = np.random.normal(1, sd, X.shape)
    return(X*noise_array)

# defining the range of noise amplitude we want to explore
sd_space = np.linspace(0, 0.3, 6)

# creating some lists to store the model names, noise amplitude, and accuracy
temp_models = []
sd_list = []
accuracy = []

# looping over several replicates and noise amplitudes
for replicate in range(20):
    for sd in sd_space:

        X_train_noisy = add_noise(X_train, sd)

        X_test_noisy = add_noise(X_test, sd)

        ## fitting each model at given time
        for model_id in range(len(names)):
            model = model_cv[model_id].best_estimator_
            model.fit(X_train_noisy, y_train)
            accuracy.append(model.score(X_test_noisy, y_test))
            temp_models.append(names[model_id])
            sd_list.append(sd)


noise_df_long = pd.DataFrame({'model': temp_models, 'Sd': sd_list, 'Accuracy': accuracy})

noise_df_long.head()

	model	Accuracy
0	knn	0.994649
1	random_forest	0.995279
2	neuralnet	0.996223
3	SVC	0.996223
4	ridge	0.992446

Finally, we can plot the accuracy of the different models against the noise amplitude:

fig, ax = plt.subplots(figsize=(12, 8), sharey=True)
sns.lineplot(ax = ax, data = noise_df_long,
             x = 'Sd', y = 'Accuracy',
             ci = 100,
             hue = 'model', style = "model",
             palette = 'colorblind', linewidth = 3,
    markers=True, dashes = False)
ax.set_title('Accuracy of different models over increasing measurement noise')
ax.set(ylabel = 'Accuracy', xlabel = 'Noise amplitude')

[Text(0, 0.5, 'Accuracy'), Text(0.5, 0, 'Noise amplitude')]

png

Conclusion

All models are quite robust, with the accuracy never falling below 96%. The neural net model seems to be the more robust, reaching a higher accuracy than other models on tricky data despite the rough tuning of its hyper-parameters. As such, it would probably be my go-to model for future similar species identification tasks.

Posted on:: July 6, 2022

Length:: 14 minute read, 2968 words

See Also: