# Introduction

There are countless examples in the literature where deep learning is used to make predictions using protein sequences. In order to use protein sequences as inputs for deep learning models they need to be transformed into numerical descriptors. This can be performed in several different ways depending on the underlying data and the purpose of the model. In this notebook I highlight four different ways to generate such numerical features from protein sequences using; 1) the Amino Acid composition, 2) generation of N-grams, 3) creation of word embeddings and 4) BLOSUM matrices. To illustrate how to use these descriptors I use them to train a deep learning model to predict protein classifications. The models themselves were kept as basic as possible and although reaching good accuracy were not fully optimized.

# Libraries

The below libraries were used to process the data and build the models.

# import the necessary packages
from collections import Counter
from keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split, KFold
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler, LabelBinarizer, OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn import preprocessing
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.models import Model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout, Embedding, Flatten, Concatenate, Input, Conv1D, MaxPooling1D, LeakyReLU, LSTM, BatchNormalization
from keras.losses import mean_squared_error
from sklearn.metrics import r2_score, accuracy_score
from keras import backend as K
from keras.utils.np_utils import to_categorical
from sklearn import linear_model
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import cv2
np.random.seed(1337) # for reproducibility


# Data

The dataset was compiled form the Protein Data Bank and posted by SHAHIR on kaggle for people to use to test different modeling approaches. The PDB database is used to visualize and store three-dimensional structural data of proteins. Each structure comes with a comprehensive list of all the experimental details that was used to generate the structure and the characteristics of the protein. It also classifies the protein into a family based on structural and sequence similarity with other proteins.

## Processing

Before using the data several steps were performed to make it more presentable for our deep learning model which is outlined below.

The data consists of two separate files. One file containing all the details about the structure and a second file containing the details on the protein sequence. Both files are loaded and merged together based on Structure ID. The file also contains structural data from DNA and RNA. These will be removed as we are only interested in the protein structures. We are only interested in the protein IDs and the sequences so we drop all other columns.

pdb_data_no_dups = pd.read_csv("Data/pdb_data_no_dups.csv")
#pdb_data_no_dups = pdb_data_no_dups[[""]]
#Merge data on structureId and select only protein structures
data_merged = pdb_data_no_dups.merge(pdb_data_seq, how='inner', on='structureId').query('macromoleculeType_x == "Protein"')
#Select columns we are interested in and remove duplicates
data_merged = data_merged[["structureId", "classification", "sequence"]]
data_merged = data_merged.drop_duplicates(["structureId", "classification", "sequence"])
data_merged = data_merged.reset_index(drop=True)
#Print data
'The data has {} rows and {} columns'.format(data_merged.shape[0], data_merged.shape[1])

## 'The data has 166681 rows and 3 columns'

pd.set_option('display.max_columns', None)
data_merged

##        structureId                            classification  \
## 0             101M                          OXYGEN TRANSPORT
## 1             102L                     HYDROLASE(O-GLYCOSYL)
## 2             102M                          OXYGEN TRANSPORT
## 3             103L                     HYDROLASE(O-GLYCOSYL)
## 4             103M                          OXYGEN TRANSPORT
## ...            ...                                       ...
## 166676        9RSA            HYDROLASE (PHOSPHORIC DIESTER)
## 166677        9RUB                      LYASE(CARBON-CARBON)
## 166678        9WGA                       LECTIN (AGGLUTININ)
## 166679        9XIA  ISOMERASE(INTRAMOLECULAR OXIDOREDUCTASE)
## 166680        9XIM  ISOMERASE(INTRAMOLECULAR OXIDOREDUCTASE)
##
##                                                  sequence
## 1       MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSE...
## 3       MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAK...
## ...                                                   ...
## 166676  KETAAAKFERQHMDSSTSAASSSNYCNQMMKSRNLTKDRCKPVNTF...
## 166677  MDQSSRYVNLALKEEDLIAGGEHVLCAYIMKPKAGYGYVATAAHFA...
## 166678  ERCGEQGSNMECPNNLCCSQYGYCGMGGDYCGKGCQNGACWTSKRC...
## 166679  MNYQPTPEDRFTFGLWTVGWQGRDPFGDATRRALDPVESVQRLAEL...
## 166680  SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG...
##
## [166681 rows x 3 columns]


Rows with missing values are removed

data_merged.isna().sum()

## structureId       0
## classification    1
## sequence          3
## dtype: int64

data_merged = data_merged.dropna()


Having a closer look at the data one can see that some of the sequences contain AAs other then the 21 standard ones (B|O|U|X|Z).

#Count all different AAs
sequences = data_merged["sequence"].dropna()
charCounter = CountVectorizer(analyzer="char", ngram_range=(1,1))
charVec = charCounter.fit_transform(sequences.to_numpy())
AA = charCounter.get_feature_names_out()
AAcounts = np.asarray(charVec.sum(axis=0))[0]
print(dict(zip(AA, AAcounts)))

## {'a': 3378009, 'b': 19, 'c': 642515, 'd': 2484519, 'e': 2835673, 'f': 1746745, 'g': 3211970, 'h': 1176659, 'i': 2389749, 'k': 2517051, 'l': 3940616, 'm': 1001217, 'n': 1857711, 'o': 2, 'p': 2048893, 'q': 1659623, 'r': 2136127, 's': 2837198, 't': 2460573, 'u': 50, 'v': 2990099, 'w': 628670, 'x': 177570, 'y': 1546907, 'z': 28}


For the sake of simplicity I decided to remove all those sequences.

#Remove sequences that contain non standard AAs
AA_mask = sequences.str.contains('^[RHKDESTNQCGPAILMFWYV]+') data_merged = data_merged.loc[AA_mask] '{} rows were removed'.format((AA_mask == False).sum())  ## '4377 rows were removed'  If we look at the size distribution of the sequences we can see that quite a large proportion consists of sequences shorter then 20 AAs. seqLengths = data_merged["sequence"].str.len() lengths = Counter(seqLengths.to_list()) plt.clf() # visualize _ = plt.hist(seqLengths.to_list(), bins=120, edgecolor='black') _ = plt.xlim(0, 1200) plt.show()  These are most likely from proteins for which only part of the structure has been solved and it will be hard to predict an accurate classification for those. To get more accurate models I removed all sequences that have a length with less then 20 AAs. #Filter out sequences that are shorter then 20 AAs AA_length_mask = data_merged["sequence"].str.len() >= 20 data_merged = data_merged.loc[AA_length_mask]  There are >4000 different classifications defined. Most of them consisting of far to little sequences to train a model. data_merged.groupby(["classification"])["classification"] \ .count() \ .sort_values(ascending = False)  ## classification ## HYDROLASE 23465 ## TRANSFERASE 16557 ## OXIDOREDUCTASE 14288 ## IMMUNE SYSTEM 7920 ## HYDROLASE/HYDROLASE INHIBITOR 5661 ## ... ## NOVEL SEQUENCE 1 ## NONSTRUCTURAL PROTEIN 1 ## NON-HEME IRON PROTEIN 1 ## NK CELL 1 ## virus/inhibitor 1 ## Name: classification, Length: 4383, dtype: int64  To build the models I focus on the top 10 classifications. #Filter for top10 classifications classes = 10 data_merged['count'] = data_merged.groupby(["classification"])["classification"].transform('count') top_class = data_merged["count"].sort_values(ascending=False).unique()[classes] data_top_classes = data_merged[data_merged["count"] > top_class].sort_values(by=["count"], ascending=False) data_top_classes = data_top_classes.reset_index(drop=True) #Print results data_top_classes.groupby(["classification"])["classification"] \ .count() \ .sort_values(ascending = False)  ## classification ## HYDROLASE 23465 ## TRANSFERASE 16557 ## OXIDOREDUCTASE 14288 ## IMMUNE SYSTEM 7920 ## HYDROLASE/HYDROLASE INHIBITOR 5661 ## LYASE 4476 ## TRANSCRIPTION 4288 ## TRANSPORT PROTEIN 3751 ## SIGNALING PROTEIN 3284 ## VIRAL PROTEIN 2761 ## Name: classification, dtype: int64  ## Transformation To build the models we binarize the classes. See: link for an explanation on choosing the right encoding method. lb = LabelBinarizer() y = lb.fit_transform(data_top_classes["classification"]) y[1:10]  ## array([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0], ## [1, 0, 0, 0, 0, 0, 0, 0, 0, 0], ## [1, 0, 0, 0, 0, 0, 0, 0, 0, 0], ## [1, 0, 0, 0, 0, 0, 0, 0, 0, 0], ## [1, 0, 0, 0, 0, 0, 0, 0, 0, 0], ## [1, 0, 0, 0, 0, 0, 0, 0, 0, 0], ## [1, 0, 0, 0, 0, 0, 0, 0, 0, 0], ## [1, 0, 0, 0, 0, 0, 0, 0, 0, 0], ## [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]])  The sequences are transformed to a numpy array. X = data_top_classes["sequence"].to_numpy()  # Amino Acid composition model As a first most basic model we can use the Amino Acid (AA) composition of each sequence to train our model. To do this we count the number of AAs for each sequence. vectorizer = CountVectorizer(analyzer="char") X_vec = vectorizer.fit_transform(X) X_AAC = X_vec.toarray() feature_names = vectorizer.get_feature_names_out() pd.DataFrame(X_AAC, columns = feature_names)  ## a c d e f g h i k l m n p q r s t v \ ## 0 22 1 23 20 11 16 6 27 29 18 2 16 19 7 6 13 19 14 ## 1 19 4 20 15 8 18 11 17 18 24 8 9 13 4 11 18 17 10 ## 2 23 3 21 23 19 29 7 26 17 23 8 24 22 11 18 31 26 12 ## 3 23 3 21 22 19 29 7 26 17 23 8 24 22 12 18 31 26 12 ## 4 40 0 22 16 9 26 13 16 9 23 5 11 20 13 15 13 28 28 ## ... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ## 86446 13 4 4 3 4 18 0 5 5 11 2 4 6 8 5 16 8 9 ## 86447 20 5 12 13 18 25 6 10 7 30 4 21 29 17 12 20 30 22 ## 86448 12 2 5 5 5 15 1 2 4 7 3 4 3 8 8 13 9 7 ## 86449 16 5 20 9 20 25 7 10 6 28 3 19 28 20 14 22 20 22 ## 86450 53 3 28 28 17 32 6 30 35 45 7 28 24 19 13 19 24 25 ## ## w y ## 0 6 11 ## 1 1 9 ## 2 11 26 ## 3 11 26 ## 4 4 12 ## ... .. .. ## 86446 2 8 ## 86447 5 9 ## 86448 3 8 ## 86449 4 10 ## 86450 10 17 ## ## [86451 rows x 20 columns]  ## AAC Model For the AA model I use three dense layers with the last layer having 10 units which is equal to the amount of classes that we try to predict. I use early stopping based on the accuracy of the test data to avoid over fitting. To limit the training time I increased the batch size to 128. The model is wrapped into a function such that I can re-use the model for Cross Validation. def Model_AAC(X_train, y_train, X_test, y_test, classes = 10): #Define model model = Sequential() model.add(Dense(16, input_dim=X_train.shape[1], activation="relu")) model.add(Dense(8, activation="relu")) model.add(Dense(classes, activation="softmax")) #Define optimizer opt = Adam(learning_rate=0.01) early_stopping = EarlyStopping(min_delta=0.01, patience=5, monitor='val_accuracy', verbose=0, mode="max", restore_best_weights=True ) #Compile model model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy']) model.fit(x=X_train, y=y_train, validation_data=(X_test, y_test), epochs=100, callbacks=[early_stopping], verbose=0, batch_size=128) y_pred = model.predict(X_test) test_acc = accuracy_score(np.argmax(y_test, axis=1), np.argmax(y_pred, axis=1)) return test_acc  ## Valdidation To test the model we check its performance on the test data that we set apart and using 5-fold cross validation. ### Training data To asses the model the data is split into a train set and 20% of the data is held back to evaluate our model. To avoid bias I normalized the data using the MinMaxScaler() (see: link for an explanation on normalization of data). #Generate train and test data X_train, X_test, y_train, y_test = train_test_split(X_AAC, y, test_size=.2, random_state=42, shuffle=True) #Scale data scaler = MinMaxScaler() X_train = scaler.fit_transform(pd.DataFrame(X_train, columns = feature_names)) #Fit and transform data on training data X_test = scaler.transform(pd.DataFrame(X_test, columns = feature_names)) #Transform data using the training parameters for fitting. #Fit data to Model train_acc = Model_AAC(X_train, y_train, X_test, y_test) print("Accuracy-train = {:.3f}".format(train_acc))  ## Accuracy-train = 0.452  ### Cross Validation Estimate accuracy using 5-fold cross validation. #Generate X, y data ACC = [] kf = KFold(n_splits=5, random_state=42, shuffle=True) i = 1 for train_index, test_index in kf.split(X_AAC): X_train, X_test = X_AAC[train_index], X_AAC[test_index] y_train, y_test = y[train_index], y[test_index] scaler = MinMaxScaler() X_train = scaler.fit_transform(pd.DataFrame(X_train)) X_test = scaler.transform(pd.DataFrame(X_test)) pred = Model_AAC(X_train, y_train, X_test, y_test) print("Accuracy-fold-{} = {:.3f}".format(i, pred)) i = i+1 ACC += [pred]  ## Accuracy-fold-1 = 0.435 ## Accuracy-fold-2 = 0.448 ## Accuracy-fold-3 = 0.428 ## Accuracy-fold-4 = 0.443 ## Accuracy-fold-5 = 0.429  print("Accuracy-mean = {:.3f}".format(np.mean(ACC, dtype=object)))  ## Accuracy-mean = 0.437  ## Conclusion Using the AA composition alone we get an accuracy of around 0.4. This is not too bad considering we are trying to predict 10 different classes. This type of encoding can work well for shorter sequences like peptides. However, with this type of encoding any information on the order of the sequences is lost. Especially for longer sequences this is important as many classes are defined by specific domains within the sequence. # N-grams model To preserve some information on the order of AAs in the sequences we can count the occurrences of the most common AA combinations across all the sequences. In the below example we look at the 20 most common 3 letter AA combinations and then count the occurrence of each for all the sequences. vectorizer = CountVectorizer(analyzer="char", ngram_range=(3,3), max_features = 20) X_train_vec = vectorizer.fit_transform(X) feature_names = vectorizer.get_feature_names_out() #Print results pd.DataFrame(X_train_vec.toarray(), columns = feature_names)  ## aaa aag aal ala ale all avl eal ell gla gsg gss hhh laa \ ## 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 ## 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 ## 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 ## 3 0 0 0 0 0 0 0 0 0 0 1 0 0 0 ## 4 0 0 1 0 0 0 0 0 0 0 0 0 4 0 ## ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ## 86446 0 0 0 0 0 0 0 0 0 0 1 0 0 0 ## 86447 0 0 0 0 0 0 0 0 0 1 0 0 0 0 ## 86448 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## 86449 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## 86450 2 0 2 1 0 0 0 1 0 1 0 0 0 1 ## ## lke lla lle lll sgs ssg ## 0 0 0 0 0 1 0 ## 1 0 0 0 0 0 0 ## 2 0 0 0 0 0 0 ## 3 0 0 0 0 0 0 ## 4 0 1 0 0 0 0 ## ... ... ... ... ... ... ... ## 86446 0 0 0 0 1 0 ## 86447 0 0 0 0 0 1 ## 86448 0 0 0 0 0 0 ## 86449 0 0 0 0 0 0 ## 86450 0 1 0 0 0 0 ## ## [86451 rows x 20 columns]  We use the above principle to count the occurrence of AA combinations with a length of 1 to 3 AAs and use the top 8000 features as input for our model. Note that we split the data in a train and test set first. We then generate all the N-grams based on the training data. This makes sure that we don’t have data leakage when we estimate the accuracy on the test data. The occurrences are then counted for both the test and train data. As the final step we scale the values (see the AA composition model). #Generate train and test data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42, shuffle=True) #Generate N-grams and fit on train and test data vectorizer = CountVectorizer(analyzer="char", ngram_range=(1,3), max_features = 8000) X_train_vec = vectorizer.fit_transform(X_train) X_test_vec = vectorizer.transform(X_test) feature_names = vectorizer.get_feature_names_out() #Scale values scaler = MinMaxScaler() X_train_AAN_scaled = scaler.fit_transform(pd.DataFrame(X_train_vec.toarray(), columns = feature_names)) X_test_AAN_scaled = scaler.transform(pd.DataFrame(X_test_vec.toarray(), columns = feature_names)) #Use X_train parameters for fitting. #Print results pd.set_option('display.max_columns', 10) pd.DataFrame(X_train_AAN_scaled, columns = feature_names)  ## a aa aaa aac aad ... yys yyt yyv yyw \ ## 0 0.036585 0.000000 0.000000 0.0 0.0 ... 0.0 0.0 0.000000 0.0 ## 1 0.004878 0.000000 0.000000 0.0 0.0 ... 0.0 0.5 0.000000 0.0 ## 2 0.029268 0.013245 0.000000 0.0 0.2 ... 0.0 0.0 0.000000 0.0 ## 3 0.151220 0.119205 0.040000 0.0 0.0 ... 0.0 0.0 0.000000 0.0 ## 4 0.007317 0.000000 0.000000 0.0 0.0 ... 0.0 0.0 0.000000 0.0 ## ... ... ... ... ... ... ... ... ... ... ... ## 69155 0.080488 0.033113 0.006667 0.0 0.0 ... 0.0 0.5 0.000000 0.0 ## 69156 0.007317 0.000000 0.000000 0.0 0.0 ... 0.0 0.0 0.333333 0.0 ## 69157 0.119512 0.039735 0.000000 0.0 0.2 ... 0.0 0.0 0.000000 0.0 ## 69158 0.007317 0.000000 0.000000 0.0 0.0 ... 0.0 0.0 0.000000 0.0 ## 69159 0.078049 0.013245 0.000000 0.0 0.0 ... 0.0 0.0 0.000000 0.0 ## ## yyy ## 0 0.0 ## 1 0.0 ## 2 0.0 ## 3 0.0 ## 4 0.0 ## ... ... ## 69155 0.0 ## 69156 0.0 ## 69157 0.0 ## 69158 0.0 ## 69159 0.0 ## ## [69160 rows x 8000 columns]  ## Model For comparison we use the same set-up as the Amino Acid composition model. However, more layers and nodes will most likely be able to improve the accuracy since we have far more features. def Model_AAN(X_train, y_train, X_test, y_test, classes = 10): #Define model model = Sequential() model.add(Dense(16, input_dim=X_train.shape[1], activation="relu")) model.add(Dense(8, activation="relu")) model.add(Dense(classes, activation="softmax")) #Define optimizer opt = Adam(learning_rate=0.01) early_stopping = EarlyStopping(min_delta=0.01, patience=5, monitor='val_accuracy', verbose=0, mode="max", restore_best_weights=True ) #Compile model model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy']) model.fit(x=X_train, y=y_train, validation_data=(X_test, y_test), epochs=100, callbacks=[early_stopping], verbose=0, batch_size=128) y_pred = model.predict(X_test) test_acc = accuracy_score(np.argmax(y_test, axis=1), np.argmax(y_pred, axis=1)) return test_acc  ## Validation ### Training data The same as with the AA composition model the values are scaled and fit to the model #Fit data to Model train_acc = Model_AAN(X_train_AAN_scaled, y_train, X_test_AAN_scaled, y_test) print("Accuracy-train = {:.3f}".format(train_acc))  ## Accuracy-train = 0.806  ### Cross Validation #Generate X, y data ACC = [] kf = KFold(n_splits=5, random_state=42, shuffle=True) i = 1 for train_index, test_index in kf.split(X): X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] #Generate N-grams vectorizer = CountVectorizer(analyzer="char", ngram_range=(1,3), max_features = 8000) X_train_vec = vectorizer.fit_transform(X_train) X_test_vec = vectorizer.transform(X_test) feature_names = vectorizer.get_feature_names_out() #Scale values scaler = MinMaxScaler() X_train_scaled = scaler.fit_transform(pd.DataFrame(X_train_vec.toarray(), columns = feature_names)) X_test_scaled = scaler.transform(pd.DataFrame(X_test_vec.toarray(), columns = feature_names)) #Use X_train parameters for fitting. pred = Model_AAN(X_train_scaled, y_train, X_test_scaled, y_test) print("Accuracy-fold-{} = {:.3f}".format(i, pred)) i = i+1 ACC += [pred]  ## Accuracy-fold-1 = 0.807 ## Accuracy-fold-2 = 0.802 ## Accuracy-fold-3 = 0.811 ## Accuracy-fold-4 = 0.794 ## Accuracy-fold-5 = 0.804  print("Accuracy-mean = {:.3f}".format(np.mean(ACC)))  ## Accuracy-mean = 0.804  ## Conclusion Using N-grams we get an accuracy of around 0.8. This is a very big improvement over using only the AA composition. The increase in accuracy is mostly because with this kind of encoding we are capturing the AA motifs that are most common among the sequences. We might be able to increase the accuracy even further by playing around with the length of the words and increasing the complexity of the model by adding more layers and nodes. # Embedding model For this encoding the sequences are first split into AAs and each AA is assigned a number (tokenized). #Tokenize sequences tk = Tokenizer(char_level=True) tk.fit_on_texts(X) seq_tok_all = tk.texts_to_sequences(X) seq_tok_all[1]  ## [18, 15, 10, 6, 1, 4, 12, 20, 1, 9, 7, 13, 16, 2, 16, 10, 1, 17, 7, 6, 7, 8, 3, 8, 4, 3, 4, 4, 7, 12, 5, 6, 2, 6, 12, 10, 10, 7, 5, 1, 9, 11, 5, 3, 11, 13, 1, 8, 16, 10, 1, 13, 8, 17, 17, 17, 16, 7, 17, 8, 3, 3, 13, 1, 6, 1, 9, 7, 11, 16, 3, 2, 9, 4, 10, 3, 5, 2, 18, 7, 9, 7, 11, 10, 12, 3, 10, 7, 18, 2, 1, 9, 7, 3, 7, 9, 19, 18, 14, 2, 3, 17, 6, 4, 17, 4, 18, 7, 8, 12, 3, 17, 8, 9, 3, 17, 10, 5, 1, 16, 14, 12, 3, 5, 11, 2, 10, 14, 8, 3, 7, 8, 18, 14, 5, 1, 5, 20, 3, 9, 1, 14, 6, 3, 8, 12, 9, 15, 18, 1, 2, 5, 1, 15, 9, 10, 8, 5, 1, 12, 7, 7, 8, 5, 10, 16, 20, 3, 17, 6, 16, 8, 1, 5, 13, 5, 9, 14, 2, 1, 5, 1, 6, 12, 13, 13, 6, 4, 1, 15, 5, 16, 2, 2, 17, 4, 2, 6, 1, 11, 5, 9, 9, 1, 12, 8, 10, 12, 8, 8, 4, 9, 18, 6, 9, 2, 20, 13, 12, 14, 1, 11, 5, 5, 13, 8, 7, 10, 11, 11, 2, 1, 11, 10, 12, 6, 2, 2, 7, 6, 2, 6, 2, 1, 3, 10, 10, 11, 9, 2, 9, 7, 7, 14]  The protein sequences have variable lengths while for deep learning models equal size sequences are required. For this reason shorter sequences need to be padded. The longest sequence in the data is around 5000 AAs long therefore by default all other sequences will be padded to the same length. This will result in a very large data set that will take very long to train. For this reason I decided to add a maximum length of 256 AAs. Sequences shorther then 256 AAs will be padded and sequences longer then 256 AAs will be trimmed. max_length = 256 #max_length = len(max(X, key=len)) X_vec = pad_sequences(seq_tok_all, maxlen=max_length, padding='post') #padding='post'  ## Model For this model a Convolutional kernel (Conv1D) is used as this type seemed to work better then a dense layer with this type of sequence representation. Before the data is fed into the model an Embedding layer is used which which transforms the tokenized sequences into a continues numeric vector of length 8. def Model_embed(X_train, y_train, X_test, y_test, max_length=max_length): #Define model model = Sequential() model.add(Embedding(input_dim=len(tk.word_index)+1, output_dim=8, input_length=max_length, trainable=True)) model.add(Conv1D(filters=64, kernel_size=6, padding="same", activation="relu")) model.add(Flatten()) model.add(Dense(10, activation="softmax")) opt = Adam(learning_rate=0.01) early_stopping = EarlyStopping(min_delta=0.01, patience=5, monitor='val_accuracy', verbose=0, mode="max", restore_best_weights=True ) #Compile model model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy']) model.fit(x=X_train, y=y_train, validation_data=(X_test, y_test), epochs=100, callbacks=[early_stopping], verbose=0, batch_size=128) y_pred = model.predict(X_test) test_acc = accuracy_score(np.argmax(y_test, axis=1), np.argmax(y_pred, axis=1)) return test_acc  ### Training data X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size=.2, random_state=42, shuffle=True) train_acc_embed = Model_embed(X_train, y_train, X_test, y_test, max_length=max_length) print("Accuracy-train = {:.3f}".format(train_acc_embed))  ## Accuracy-train = 0.762  ### Cross Validation #Generate X, y data ACC = [] kf = KFold(n_splits=5, random_state=42, shuffle=True) i = 1 for train_index, test_index in kf.split(X_vec): X_train, X_test = X_vec[train_index], X_vec[test_index] y_train, y_test = y[train_index], y[test_index] pred = Model_embed(X_train, y_train, X_test, y_test, max_length=max_length) print("Accuracy-fold-{} = {:.3f}".format(i, pred)) i = i+1 ACC += [pred]  ## Accuracy-fold-1 = 0.749 ## Accuracy-fold-2 = 0.756 ## Accuracy-fold-3 = 0.757 ## Accuracy-fold-4 = 0.764 ## Accuracy-fold-5 = 0.758  print("Accuracy-mean = {:.3f}".format(np.mean(ACC)))  ## Accuracy-mean = 0.757  ## Conclusion The embedded model reaches an accuracy of around 0.75. This is much better then the AA composition model but still lower then the N-grams model. This is most likely because we loose some of the sequence information due to the padding and trimming that is required to get equal length vectors for our deep learning model. # BLOSUM encoding BLOSUM matrices are widely used in bioinformatics to generate sequence alignments. We can use this encoding also as an input for deep learning. BLOSUM encoding of the sequences was performed using the protr R package using the below code. library(protr) #Function for BLOSUM encoding of the AA sequences. BLOSUM_df <- function(sequences, submat="AABLOSUM62", k=5, lag=7, silent=TRUE){ df <- t(sapply(sequences, function(i) protr::extractBLOSUM(i, submat = submat, k=k, lag=lag, silent=silent))) df <- data.frame(df) rownames(df) <- c() return(df) } #Import data sequences <- pydata_top_classes["sequence"]
X_BLOSUM <-
BLOSUM_df(sequences\$sequence, submat="AABLOSUM62", k=9, lag=18, silent=TRUE) #9 / 18
#write_csv(X_BLOSUM, "BLOSUM_k9_lag19.csv")


BLOSUM sequence matrices generated in R were saved into a .csv file and loaded back into python

BLOSUM_seq_matrixes = pd.read_csv("Data/BLOSUM_k9_lag19.csv")
pd.set_option('display.max_columns', 6)
BLOSUM_seq_matrixes

##        scl1.lag1  scl2.lag1  scl3.lag1  ...  scl8.7.lag18  scl9.7.lag18  \
## 0      -0.002615   0.000283  -0.002937  ...     -0.003019      0.000230
## 1      -0.000948   0.000633   0.003090  ...      0.005362     -0.001396
## 2       0.005699   0.004398   0.003010  ...     -0.002283     -0.003664
## 3       0.005689   0.004419   0.002920  ...     -0.002295     -0.003665
## 4       0.002448   0.008009   0.000728  ...      0.000387      0.002231
## ...          ...        ...        ...  ...           ...           ...
## 86446  -0.004528   0.012028   0.008579  ...     -0.002316     -0.006078
## 86447  -0.002841   0.004932  -0.007188  ...      0.000123      0.001226
## 86448   0.004192   0.007186   0.023998  ...     -0.008741     -0.006935
## 86449  -0.002852   0.005154  -0.003875  ...     -0.000257      0.000208
## 86450  -0.001217   0.003631  -0.001252  ...     -0.000895      0.000950
##
##        scl9.8.lag18
## 0          0.000997
## 1          0.001682
## 2          0.003491
## 3          0.003506
## 4          0.004065
## ...             ...
## 86446      0.000662
## 86447      0.002007
## 86448      0.002186
## 86449      0.002532
## 86450      0.000996
##
## [86451 rows x 1458 columns]

X_BLOSUM = BLOSUM_seq_matrixes.to_numpy()


## Model

def Model_BLOSUM(X_train, y_train, X_test, y_test, classes = 10):
#Define model
model = Sequential()
#Define optimizer

early_stopping = EarlyStopping(min_delta=0.01,
patience=5,
monitor='val_accuracy',
verbose=0,
mode="max",
restore_best_weights=True
)
#Compile model
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])

model.fit(x=X_train, y=y_train,
validation_data=(X_test, y_test),
epochs=100,
callbacks=[early_stopping],
verbose=0,
batch_size=128)

y_pred = model.predict(X_test)
test_acc = accuracy_score(np.argmax(y_test, axis=1), np.argmax(y_pred, axis=1))
return test_acc


## Validation

### Training data

X_train, X_test, y_train, y_test = train_test_split(X_BLOSUM, y, test_size=.2, random_state=42, shuffle=True)

train_acc_BLOSUM = Model_BLOSUM(X_train, y_train, X_test, y_test)

print("Accuracy-train = {:.3f}".format(train_acc_BLOSUM))

## Accuracy-train = 0.753


### Cross Validation

ACC = []
kf = KFold(n_splits=5, random_state=42, shuffle=True)
i = 1
for train_index, test_index in kf.split(X_BLOSUM):
X_train, X_test = X_BLOSUM[train_index], X_BLOSUM[test_index]
y_train, y_test = y[train_index], y[test_index]

pred = Model_BLOSUM(X_train, y_train, X_test, y_test)
print("Accuracy-fold-{} = {:.3f}".format(i, pred))
i = i+1

ACC += [pred]

## Accuracy-fold-1 = 0.777
## Accuracy-fold-2 = 0.760
## Accuracy-fold-3 = 0.776
## Accuracy-fold-4 = 0.757
## Accuracy-fold-5 = 0.766

print("Accuracy-mean = {:.3f}".format(np.mean(ACC)))

## Accuracy-mean = 0.767


## Conclusion

The BLOSUM model reaches an accuracy of around 0.75. This is similar to the N-grams model. However, generating the BLOSUM matrixes took much langer then generating the N-grams. Because there are so many features generated a higher accuracy might be reached by increasing layers and nodes.