I need to develop autoencoder using tensorflow, when I am checking the documentation and tutorial I can see many example with image data and MNIST_data which is pre-processed numerical data.
Where as in my case the data is in text format
like,
uid orig_h orig_p trans_depth method host ====================================================================== 5fg288 192.168.1.4 80 1 POST ex1.com 2fg888 192.168.1.3 80 2 GET ex2.com
So how can I convert these data to numerical format which accept by tensor flow. I couldn't find any example in tensor flow tutorial,
I am beginner in tensor-flow, please help.
Update
Based on the instruction below I have created word to vector mapping by referring the tutorial here
The input in pandas dataframe
host method orig_h orig_p trans_depth uid 0 ex1.com POST 192.168.1.4 80 1 5fg288 1 ex2.com GET 192.168.1.3 443 2 2fg888
And
Bag of word ---> ['5fg288', '2fg888', '80', 'GET', '443', '1', 'ex2.com', '192.168.1.4', '192.168.1.3', '2', 'ex1.com', 'POST']
Now for each cell in I have array of values like
192.168.1.4 ---> [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] ex1.com ---> [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0] 80 ----> [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
So How can I reshape this data to give tensor flow
should it be like
data = array([ [[0.0,...],[0.0,...],[0.0,...],[0.0,...],[0.0,...],[0.0,...]], [[0.0,...],[0.0,...],[0.0,...],[0.0,...],[0.0,...],[0.0,...]] ])
That is each feature as an array of float, and there are 6 features in single sample. Is that possible,
1 Answers
Answers 1
Tensorflow accepts data in numpy format. Pandas dataframes can be converted to numpy using the df.as_matrix()
function. But the crux of your question is how to convert these various data types into continuous numeric representations for a neural network (or any machine learning method).
The answer, linked to below, provides some helpful references to sci-kit documentation which discuss the details, too numerous to re-write here:
Machine learning with multiple feature types in python
Some of your data will translate easily after reading that guide, such as trans_depth
orig_p
, and method
which appear to be categorical data. In cases like this, you will convert them to multiple features of {1,0} values that represent whether that class is present or not, for example, orig_p
might be represented as two features x1, and x2. x1=1
if orig_p=80
, 0 otherwise, and x2=1
if orig_p=443
, 0 otherwise.
You might do the same with the host, but you might have to think out how and if you really want to use the host. For example, if you consider it important you could define a categorical feature that identifies .com
, .edu
, .org
, etc. domains only, because individual hostnames might be too numerous to want to represent.
You might also consider clustering hostnames into categories of hosts based on some database (if such a thing exists), and use the cluster which the hostname belongs to as a categorical feature.
for orig_h
you might consider grouping IPs by region and define a categorical feature per region.
uid
looks to be unique per user, so you might not use that column of data.
You will need to think this out per data point. Start with the documentation I linked to, but in general, this is a question of standard data mining, any good book on data mining will be invaluable in understanding these concepts further, here's an easy one to find online via a google search:
I will also include the following reference because they provide the best tutorials I've seen hands down, and their introduction to ML section has a set of articles that will be very useful to read. It's slightly tangent to the question, but will be useful I expect.
0 comments:
Post a Comment