I have the following data
feat_1 feat_2 ... feat_n label gene_1 100.33 10.2 ... 90.23 great gene_2 13.32 87.9 ... 77.18 soso .... gene_m 213.32 63.2 ... 12.23 quitegood
The size of M
is large ~30K rows, and N
is much smaller ~10 columns. My question is what is the appropriate Deep Learning structure to learn and test the data like above.
At the end of the day, the user will give a vector of genes with expression.
gene_1 989.00 gene_2 77.10 ... gene_N 100.10
And the system will label which label does each gene apply e.g. great or soso, etc...
By structure I mean one of these:
- Convolutional Neural Network (CNN)
- Autoencoder
- Deep Belief Network (DBN)
- Restricted Boltzman Machine
3 Answers
Answers 1
To expand a little on @sung-kim 's comment:
- CNN's are used primarily for problems in computer imaging, such as classifying images. They are modelled on animals visual cortex, they basically have a connection network such that there are tiles of features which have some overlap. Typically they require a lot of data, more than 30k examples.
- Autoencoder's are used for feature generation and dimensionality reduction. They start with lots of neurons on each layer, then this number is reduced, and then increased again. Each object is trained on itself. This results in the middle layers (low number of neurons) providing a meaningful projection of the feature space in a low dimension.
- While I don't know much about DBN's they appear to be a supervised extension of the Autoencoder. Lots of parameters to train.
- Again I don't know much about Boltzmann machines, but they aren't widely used for this sort of problem (to my knowledge)
As with all modelling problems though, I would suggest starting from the most basic model to look for signal. Perhaps a good place to start is Logistic Regression before you worry about deep learning.
If you have got to the point where you want to try deep learning, for whatever reasons. Then for this type of data a basic feed-forward network is the best place to start. In terms of deep-learning, 30k data points is not a large number, so always best start out with a small network (1-3 hidden layers, 5-10 neurons) and then get bigger. Make sure you have a decent validation set when performing parameter optimisation though. If your a fan of the scikit-learn
API, I suggest that Keras is a good place to start
One further comment, you will want to use a OneHotEncoder on your class labels before you do any training.
EDIT
I see from the bounty and the comments that you want to see a bit more about how these networks work. Please see the example of how to build a feed-forward model and do some simple parameter optisation
import numpy as np from sklearn import preprocessing from keras.models import Sequential from keras.layers.core import Dense, Activation, Dropout # Create some random data np.random.seed(42) X = np.random.random((10, 50)) # Similar labels labels = ['good', 'bad', 'soso', 'amazeballs', 'good'] labels += labels labels = np.array(labels) np.random.shuffle(labels) # Change the labels to the required format numericalLabels = preprocessing.LabelEncoder().fit_transform(labels) numericalLabels = numericalLabels.reshape(-1, 1) y = preprocessing.OneHotEncoder(sparse=False).fit_transform(numericalLabels) # Simple Keras model builder def buildModel(nFeatures, nClasses, nLayers=3, nNeurons=10, dropout=0.2): model = Sequential() model.add(Dense(nNeurons, input_dim=nFeatures)) model.add(Activation('sigmoid')) model.add(Dropout(dropout)) for i in xrange(nLayers-1): model.add(Dense(nNeurons)) model.add(Activation('sigmoid')) model.add(Dropout(dropout)) model.add(Dense(nClasses)) model.add(Activation('softmax')) model.compile(loss='categorical_crossentropy', optimizer='sgd') return model # Do an exhaustive search over a given parameter space for nLayers in xrange(2, 4): for nNeurons in xrange(5, 8): model = buildModel(X.shape[1], y.shape[1], nLayers, nNeurons) modelHist = model.fit(X, y, batch_size=32, nb_epoch=10, validation_split=0.3, shuffle=True, verbose=0) minLoss = min(modelHist.history['val_loss']) epochNum = modelHist.history['val_loss'].index(minLoss) print '{0} layers, {1} neurons best validation at'.format(nLayers, nNeurons), print 'epoch {0} loss = {1:.2f}'.format(epochNum, minLoss)
Which outputs
2 layers, 5 neurons best validation at epoch 0 loss = 1.18 2 layers, 6 neurons best validation at epoch 0 loss = 1.21 2 layers, 7 neurons best validation at epoch 8 loss = 1.49 3 layers, 5 neurons best validation at epoch 9 loss = 1.83 3 layers, 6 neurons best validation at epoch 9 loss = 1.91 3 layers, 7 neurons best validation at epoch 9 loss = 1.65
Answers 2
Deep learning structure would be recommended if you were dealing with raw data and wanted to find features, that work towards your classification goal, automatically. But based on the names of your columns and their number (only 10) it seems that you have your features already engineered.
For this reason you could just go with a standard multi-layer neural network and use supervised learning (back propagation). Such network would have the number of inputs matching the number of your columns (10), followed by a number of hidden layers, and then followed by an output layer with the number of neurons matching the number of your labels. You could experiment with using different number of hidden layers, neurons, different neuron types (sigmoid, tanh, rectified linear etc.) and so on.
Alternatively you could use the raw data (if it's available) and then go with DBNs (they're known to be robust and achieve good results across different problems) or auto-encoders.
Answers 3
If you expect the output to be thought of like scores for a label (as I understood from your question), try a supervised multi-class logistic regression classifier. (the highest score takes the label).
If you're bound to use deep-learning.
A simple feed-forward ANN should do, supervise learning through back propagation. Input layer with N neurons, and one or two hidden layers can be added, not more than that. There is no need to go 'deep' and add more layers for this data, there is risk to overfit the data easily with more layers, if you do so it can be tricky to figure out what the problem is, and the test accuracy will be affected greatly.
Simply plotting or visualizing the data i.e with t-sne can be a good start, if you need to figure out which features are important (or any correlation that may exist).
you can then play with higher powers of those feature dimensions/ or add increased weight to their score.
For problems like this, deep-learning probably isn't very well suited. but a simpler ANN architecture like this should work well depending on the data.
0 comments:
Post a Comment