Compare each row of a matrix in Matlab with the remaining ones

I have a matrix A in Matlab of zeros and ones with dimension BxM.

Specifically, A contains all the possible dispositions of ones and zeros in M spaces considering also the order (hence, B=2^M).

The matrix is built such that, for any i=1,...,N, A(i,:)>A(j,:) or A(i,:)><A(j,:) for j=i+1,...,N.
Writing A(i,:)>=A(j,:) means that A(i,h)>=A(j,h) for h=1,...,M.
Writing A(i,:)>A(j,:) means that A(i,h)>=A(j,h) for h=1,...,M, with strict inequality for at least one h.
Writing A(i,:)><A(j,:) means that it is not possible to establish whether A(i,:)>=A(j,:) or viceversa.

For example when M=3

  A=[1 1 1;      1 1 0;      1 0 1;      1 0 0;      0 1 1;      0 1 0;      0 0 1;      0 0 0];

Consider

B=cell(2^M, 2^M);

For any two comparable rows of A, A(i,:)>A(j,:), I want B{i,j} containing all rows A(k,:) such that A(j,:)<A(k,:)<A(i,:).

In the example above, the desired output would be

B{1,4}=[1 1 0; 1 0 1]; B{1,6}=[1 1 0; 0 1 1]; B{1,7}=[1 0 1; 0 1 1]; B{1,8}=[1 1 0; 1 0 1; 1 0 0; 0 1 1; 0 1 0; 0 0 1]; B{2,8}=[1 0 0; 0 1 0]; B{3,8}=[1 0 0; 0 0 1]; B{5,8}=[0 0 1; 0 1 0];

This code does what I want

B=cell(2^M, 2^M);    for j=1:size(A,1)       for h=1:size(A,1)           if  sum(A(j,:)==A(h,:))~=M && sum(A(j,:)>=A(h,:))==M %if A(j,:)>A(h,:) according to the meaning indicated above               B{j,h}=A(any(bsxfun(@ne, A,A(j,:)),2) & any(bsxfun(@ne, A,A(h,:)),2) &...                all((bsxfun(@le, A,A(j,:)) &  bsxfun(@ge, A,A(h,:))),2),:);           end       end   end

However, the code above is not feasible because in my actual case M=20. Do you have suggestions on whether it is possible to speed it up and, if yes, how?

One of the main problems of the code is that, for M=20, it requires to preallocate a cell (2^20)x(2^20) which, clearly, cannot be done.

On the other hand, at the end of the double loop, lots of cells are empty (because the if condition is not satisfied by many pairs of rows), and what I really need is to keep tracks of the content and of the coordinates only of the non-empty ones. Hence, maybe, a "sparse cell" could help me here.

Any suggestion would be really appreciated.

4 Answers

Answers 1

Instead of pre-computing and memorizing the A matrix ("Old Matlab Style"),

write a "generator function" that gives you the row contents of the A-matrix for any row index. This way you convert a memory-intensive problem into a CPU-intensive problem. On top of that the subresults are independent on each other.

So what you need is

A_by_row=@(rowIdx)(....)

This way there is no need for lots of RAM, and you can distribute your problem to parfor, GPU or even multiple nodes, and then combine the subresults from sub-ranges.

Try this:

ListIdx=0;  for j=1:B   for h=1:B    if sum(A_by_row(j)==A_by_row(h))~=M,      ListIdx=ListIdx+1,      B_List(ListIdx).Coordinates=[j,h],          B_List(ListIdx).Result=**YourCodeThatMakesAnArbitraryLengthVec‌tor**,    end,   end,  end

This way, in a structure 'list' you will get, for each entry, a coordinate and the vector of answers.

Good luck.

Answers 2

Rather than brute force the answer, we can do a bit more math, to predict the outcome of the test, then just save the prediction.

We will make the following assumptions:

1) In array A, index location 'n' stores the binary representation of 'n'. Ex: A(3) = [0,1,1]

2) Because A(j) must not equal A(h), nor may the binary representation of A(j) be always less than A(h), the results must be lower triangular.

3) After plotting the locations of results for M=5, we see that there is a fractal pattern created by the compliance conditions. This fractal can be recreated using the Lucas Correspondence Theorem, "AND(NOT(j),h)=binomial_coefficient(j,h) mod(2)"

4) The following functions replicate your results.

function out = user3285148()     M = 5;     maxm = 2^M-1;     for j = 0:maxm         for h = 0:maxm;             if predict(j,h)                 B{j+1,h+1} = [binaryarray(j,M);binaryarray(h,M)];             end         end     end     out  = B; end  function out = binaryarray(in, length)     out = zeros(1,length);     bit = 0;     while in>0;         out(end-bit)=mod(in,2);         in = floor(in/2);         bit =bit+1;     end end  function out = predict(j,h)     out =0;     if h<j         out = mod(nchoosek(j,h),2);     end end

You would be advised to either store B in a compact form, or use the predictions to pull the data as needed in leu of storing (2^20)^2 cells of information.

Please let me know if this meets your requirements.

Answers 3

I'm gonna divide your problem in 3.

Find the size needed for B
Find all the pairs of comparable rows in A with at least one row in between (left hand part of B)
Find all the rows of A in between each comparable pair (right hand part of B)

Couple of concepts I will be using

Hamming weight: Number of 1's in a row
Hamming distance: Number of different values between rows

Part 1

We know that rows i and j are comparable and hold i>j if and only if any 0 in i is also a zero in j, and any 1 in j is also a 1 in i. Meaning that the hamming weight of i is always gonna be bigger than the hamming weight of j

We also know that if the hamming distance between i and j is one, then the set of rows in between is going to be empty.

So for any i we can find its viable j's by flipping at least two of its 1's into 0's.

Since we don't care yet about order, any i with the same hamming weight will have the same amount of viable j's. And if the hamming weight of i is one or zero, then there are no viable j's, because there are no two 1's to flip.

Now, we can find the number of viable j's for a given hamming weight of i w, that are at a given hamming distance d, using the function nchoosek(w,d). We know that the hamming weight of useful i's is from 2 to M and that the useful hamming distances are from 2 to each w.

So we can generate a matrix n_j of size MxM in which rows will indicate hamming weight and columns hamming distance.

n_j = zeros(M);  for w=2:20     for d=2:w         n_j(w,d) = nchoosek(w,d);     end end

If we sum each row, we will have the number of available j's for each i with a given hamming weight.

We can also find how many i's with a given useful hamming weight there are in A, since it's complete and has every combination possible. We can use again nchoosek

n_i = zeros(M,1);  for w=2:M     n_i(w) = nchoosek(M,w); end

Now we can calculate the total number of pairs ij by multiplying the number of possible i's on each hamming weight by the number of possible j's on each hamming weight.

sum(n_i .* sum(n_j,2))

Which is huge in the case M=20, it's 3.4753 e09. And that's just the number of pairs, we still need to find the number of rows in between each pair. Its still an improvement from the 2^20 x 2^20, but I would be considering my options at this point.

Continuing, we can find the number of rows in between pairs as a function of the hamming distance of each pair. Basically its 2^d -2 it comes from the fact that hamming distance indicates you the number of values that you can flip to obtain a new valid row, minus the two original rows i and j.

And since we kept the original n_j separated also by hamming weights, we can multiply all together.

n_r = 2.^(1:M) - 2;  sum(sum((n_i*n_r).*n_j))

Which for the case of M=20 is 1.0925 e12. Meaning that B will have to be of length 3.4753 e09 and hold 1.0925 e12 values, which is in the order of I wouldn't try it

Part 2

Now to generate all the pairs ij. I believe that it is going to be easier and faster to generate a function to which we input i and it gives back the valid j's.

You say you already have A, so you can directly index it, but I will use the de2bi and bi2de functions to relate index i to each row A_row_i

A_row_i = de2bi(2^M - i, M)  i = 2^M - bi2de(A_row_i)

First thing is to calculate the hamming weight of the row i, and use it obtain how many j's are we expecting.

w = sum(A_row_i);  j_s = n_j(w,:);

j_s is going to be a vector of size M that gives us how many j's we can get at a given hamming distance.

One can modify code from this answer to create a matrix flips of size sum(j_s)xw with all the necessary flipping of 1's in A_row_i to obtain the matrix of A_row_j

flips=[];  for d=2:M     c = nchoosek(1:w,d);     out = ones(j_s(d),w);     out(sub2ind([j_s(d),w],(1:j_s(d))'*ones(1,d),c))=0;     flips = [flips;out]; end

It's worth noting at this point that flips depend only on w, not on i, so there are only M-2 different ones. In theory you could precompute them to speed up code.

Now that we have the desired flips of 1's we just need to apply them to a copy of A_row_i and recover the values of j

A_row_j = repmat(A_row_i,[sum(j_s),1]); A_row_j(A_row_j==1) = flips;  j = 2^M - bi2de(A_row_j);

Now j contains all the valid j's for the respective i.

You can now loop over all values from 1 to 2^M and build B ongoing, it's going to be an improvement over the double loop, but still for M=20, its going to be a while.

Part 3

If we have valid i and j, getting the indices k of in between rows is not that difficult. First we calculate the hamming distance d between A_row_i and A_row_j, and use this to create a matrix similar to flips that will help us build all the values of A_row_k and recover index k

The values to flip are those in which A_row_i and A_row_j differ from each other, and we can find these values easily with bitxor

A_row_i = de2bi(2^M - i, M); A_row_j = de2bi(2^M - j, M);  d = pdist([A_row_i ; A_row_j],'cityblock');  flips = de2bi(1:2^d-2);  A_row_k = repmat(A_row_i,[n_r(d),1]);  flipable = repmat(bitxor(A_row_i,A_row_j),[n_r(d),1]);  A_row_k(flipable==1) = flips;  k = 2^M - bi2de(A_row_k);

Now you could loop over B, or run this code every time you get the j's for every i, whatever works better for you.

As conclusion I would really recommend leave this as generators and call them whenever you need values, because for M=20, the amount of memory needed is impractical.

Answers 4

easy math problem !

Function out=A(row , number_of_bits)         out=~bitget(row-1,number_of_bits:-1:1); end  A(3,3) ans =      1     0     1 %which is identical with your A(3,:)

function [out]=B(n1,n2,number_of_bits)    c1=A(n1,number_of_bits);    c2=A(n2,number_of_bits);    out=zeros(1,number_of_bits);    out((c1==0)&(c2==1))=-1;    if sum(out)>=0       out=zeros(1,number_of_bits)+2;       out(c1==0)=0;       out(c2==1)=1;        out=int8(out);    else       out=[];    end  %example: B(1,4,3) ans =      1     2     2 %number 2 serves as wildcard ("*") and means that element could be either 0 or 1, you should discard A(1,:) and A(4,:) from the results so the actual size_of_B is (2^ number_of_wildcards)-2

example:

B(1,1000000,20)  ans =   Columns 1 through 10      2     2     2     2     1     2     1     1     1     1   Columns 11 through 20      2     1     1     1     2     2     2     2     2     2   %which means that B(1,1000000) has (2^12)-2=4094 rows   B(1,2,20) ans =   Columns 1 through 10      1     1     1     1     1     1     1     1     1     1   Columns 11 through 20      1     1     1     1     1     1     1     1     1     2   %which means that B(1,2) has (2^1)-2=0 rows

size_of_B:

function out=size_of_B(b)      if numel(b)==0         out=0;      else         b(~(b==2))=0;         b(~(b==0))=1;         out=(2^sum(b))-2;      end %example: size_of_B(B(1,1000000,20)) ans =         4094

there is (2^19)*((2^20)-1)~= 5e11 pair of i, j for iteration if you had a for loop which done nothing it gets about : 1100 seconds! so the amount of time for calculating all non-empty-B is huge(even if magically we know which element of B is non-empty before any calculation)

tic;   for i=1:1:10^9   end   toc Elapsed time is 2.218067 seconds. tic;   for i=1:1:10^10  end   toc Elapsed time is 22.165445 seconds.

%example:  tic;  M=10;  n=1;  for i=1:2^M      for j=1:2^M          if i<j              c=B(i,j,M);              if ~(size_of_B(c)==0)                  d(n,:)=c;                  n=n+1;              end          end      end  end  toc  [s,~]=size(d);  s Elapsed time is 10.095085 seconds.  s =           52905 %for M=10 there is 52905 pair of i,j with non-empty-B values;

for M=20 it takes about 10*4^10~=10M seconds ~= 121 days! (if the memory allows)

%script for plotting number of non-empty-B versus M   d=[];  for M=1:10      tic;      n=1;      for i=1:2^M          for j=1:2^M              if i<j                  c=B(i,j,M);                  if ~(size_of_B(c)==0)                      d(n,:)=c;                      n=n+1;                  end              end          end      end      toc      [s,~]=size(d);      d=[];      h(M)=s;  end Elapsed time is 0.005418 seconds. Elapsed time is 0.002238 seconds. Elapsed time is 0.000933 seconds. Elapsed time is 0.003058 seconds. Elapsed time is 0.011650 seconds. Elapsed time is 0.044217 seconds. Elapsed time is 0.170286 seconds. Elapsed time is 0.696161 seconds. Elapsed time is 3.024212 seconds. Elapsed time is 15.836280 seconds. >> plot(h,'-o')

and estimating from the plot we could find that for M=20 there is about ~3e9 non-empty-B values; (and at least needs about 20~60 GB of memory to store B values like wildcard style which i used and much much more if you want to write all of B without wildcard style)

finally i could mention that exact number of non-empty-B could be calculated with an easy algorithm !

Coding Question

Monday, April 3, 2017

Compare each row of a matrix in Matlab with the remaining ones

4 Answers

Answers 1

Answers 2

Answers 3

Answers 4

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment

Search

Popular Posts

Labels

Blog Archive

Find Us On Facebook