Wednesday, April 18, 2018

Tensor flow input data string for auto encoder

Leave a Comment

I need to develop autoencoder using tensorflow, when I am checking the documentation and tutorial I can see many example with image data and MNIST_data which is pre-processed numerical data.

Where as in my case the data is in text format

like,

 uid       orig_h       orig_p   trans_depth      method       host ====================================================================== 5fg288   192.168.1.4      80       1               POST       ex1.com 2fg888   192.168.1.3      80       2               GET        ex2.com 

So how can I convert these data to numerical format which accept by tensor flow. I couldn't find any example in tensor flow tutorial,

I am beginner in tensor-flow, please help.

Update

Based on the instruction below I have created word to vector mapping by referring the tutorial here

The input in pandas dataframe

   host       method   orig_h        orig_p      trans_depth     uid 0  ex1.com    POST    192.168.1.4      80            1          5fg288 1  ex2.com   GET      192.168.1.3     443            2          2fg888 

And

 Bag of word ---> ['5fg288', '2fg888', '80', 'GET', '443', '1', 'ex2.com', '192.168.1.4', '192.168.1.3', '2', 'ex1.com', 'POST'] 

Now for each cell in I have array of values like

192.168.1.4 ---> [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] ex1.com     ---> [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0] 80         ----> [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 

So How can I reshape this data to give tensor flow

should it be like

data = array([ [[0.0,...],[0.0,...],[0.0,...],[0.0,...],[0.0,...],[0.0,...]], [[0.0,...],[0.0,...],[0.0,...],[0.0,...],[0.0,...],[0.0,...]] ]) 

That is each feature as an array of float, and there are 6 features in single sample. Is that possible,

1 Answers

Answers 1

Tensorflow accepts data in numpy format. Pandas dataframes can be converted to numpy using the df.as_matrix() function. But the crux of your question is how to convert these various data types into continuous numeric representations for a neural network (or any machine learning method).

The answer, linked to below, provides some helpful references to sci-kit documentation which discuss the details, too numerous to re-write here:

Machine learning with multiple feature types in python

Some of your data will translate easily after reading that guide, such as trans_depth orig_p, and method which appear to be categorical data. In cases like this, you will convert them to multiple features of {1,0} values that represent whether that class is present or not, for example, orig_p might be represented as two features x1, and x2. x1=1 if orig_p=80, 0 otherwise, and x2=1 if orig_p=443, 0 otherwise.

You might do the same with the host, but you might have to think out how and if you really want to use the host. For example, if you consider it important you could define a categorical feature that identifies .com, .edu, .org, etc. domains only, because individual hostnames might be too numerous to want to represent.

You might also consider clustering hostnames into categories of hosts based on some database (if such a thing exists), and use the cluster which the hostname belongs to as a categorical feature.

for orig_h you might consider grouping IPs by region and define a categorical feature per region.

uid looks to be unique per user, so you might not use that column of data.

You will need to think this out per data point. Start with the documentation I linked to, but in general, this is a question of standard data mining, any good book on data mining will be invaluable in understanding these concepts further, here's an easy one to find online via a google search:

https://books.google.com/books/about/Data_Mining_Concepts_and_Techniques.html?id=pQws07tdpjoC&printsec=frontcover&source=kp_read_button#v=onepage&q&f=false

I will also include the following reference because they provide the best tutorials I've seen hands down, and their introduction to ML section has a set of articles that will be very useful to read. It's slightly tangent to the question, but will be useful I expect.

https://github.com/aymericdamien/TensorFlow-Examples

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment