Tensor flow input data string for auto encoder

I need to develop autoencoder using tensorflow, when I am checking the documentation and tutorial I can see many example with image data and MNIST_data which is pre-processed numerical data.

Where as in my case the data is in text format


 uid       orig_h       orig_p   trans_depth      method       host ====================================================================== 5fg288      80       1               POST 2fg888      80       2               GET 

So how can I convert these data to numerical format which accept by tensor flow. I couldn't find any example in tensor flow tutorial,

I am beginner in tensor-flow, please help.


Based on the instruction below I have created word to vector mapping by referring the tutorial here

The input in pandas dataframe

   host       method   orig_h        orig_p      trans_depth     uid 0    POST      80            1          5fg288 1   GET     443            2          2fg888 


 Bag of word ---> ['5fg288', '2fg888', '80', 'GET', '443', '1', '', '', '', '2', '', 'POST'] 

Now for each cell in I have array of values like ---> [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0]     ---> [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0] 80         ----> [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 

So How can I reshape this data to give tensor flow

should it be like

data = array([ [[0.0,...],[0.0,...],[0.0,...],[0.0,...],[0.0,...],[0.0,...]], [[0.0,...],[0.0,...],[0.0,...],[0.0,...],[0.0,...],[0.0,...]] ]) 

That is each feature as an array of float, and there are 6 features in single sample. Is that possible,

1 Answers

Answers 1

Tensorflow accepts data in numpy format. Pandas dataframes can be converted to numpy using the df.as_matrix() function. But the crux of your question is how to convert these various data types into continuous numeric representations for a neural network (or any machine learning method).

The answer, linked to below, provides some helpful references to sci-kit documentation which discuss the details, too numerous to re-write here:

Machine learning with multiple feature types in python

Some of your data will translate easily after reading that guide, such as trans_depth orig_p, and method which appear to be categorical data. In cases like this, you will convert them to multiple features of {1,0} values that represent whether that class is present or not, for example, orig_p might be represented as two features x1, and x2. x1=1 if orig_p=80, 0 otherwise, and x2=1 if orig_p=443, 0 otherwise.

You might do the same with the host, but you might have to think out how and if you really want to use the host. For example, if you consider it important you could define a categorical feature that identifies .com, .edu, .org, etc. domains only, because individual hostnames might be too numerous to want to represent.

You might also consider clustering hostnames into categories of hosts based on some database (if such a thing exists), and use the cluster which the hostname belongs to as a categorical feature.

for orig_h you might consider grouping IPs by region and define a categorical feature per region.

uid looks to be unique per user, so you might not use that column of data.

You will need to think this out per data point. Start with the documentation I linked to, but in general, this is a question of standard data mining, any good book on data mining will be invaluable in understanding these concepts further, here's an easy one to find online via a google search:

I will also include the following reference because they provide the best tutorials I've seen hands down, and their introduction to ML section has a set of articles that will be very useful to read. It's slightly tangent to the question, but will be useful I expect.

