Showing posts with label algorithm. Show all posts
Showing posts with label algorithm. Show all posts

Friday, September 21, 2018

Detect differences between two strings

Leave a Comment

I have 2 strings

string a = "foo bar"; string b = "bar foo"; 

and I want to detect the changes from a to b. What characters do I have to change, to get from a to b?

I think there must be a iteration over each character and detect if it was added, removed or remained equal. So this is my exprected result

'f' Remove 'o' Remove 'o' Remove ' ' Remove 'b' Equal 'a' Equal 'r' Equal ' ' Add 'f' Add 'o' Add 'o' Add 

class and enum for the result:

public enum Operation { Add,Equal,Remove }; public class Difference {     public Operation op { get; set; }     public char c { get; set; } } 

Here is my solution but the "Remove" case is not clear to me how the code has to look like

public static List<Difference> CalculateDifferences(string left, string right) {     int count = 0;     List<Difference> result = new List<Difference>();     foreach (char ch in left)     {         int index = right.IndexOf(ch, count);         if (index == count)         {             count++;             result.Add(new Difference() { c = ch, op = Operation.Equal });         }         else if (index > count)         {             string add = right.Substring(count, index - count);             result.AddRange(add.Select(x => new Difference() { c = x, op = Operation.Add }));             count += add.Length;         }         else         {             //Remove?         }     }     return result; } 

How does the code have to look like for removed characters?


Update - added a few more examples

example 1:

string a = "foobar"; string b = "fooar"; 

expected result:

'f' Equal 'o' Equal 'o' Equal 'b' Remove 'a' Equal 'r' Equal 

example 2:

string a = "asdfghjk"; string b = "wsedrftr"; 

expected result:

'a' Remove 'w' Add 's' Equal 'e' Add 'd' Equal 'r' Add 'f' Equal 'g' Remove 'h' Remove 'j' Remove 'k' Remove 't' Add 'r' Add 

Update:

Here is a comparison between Dmitry's and ingen's answer: https://dotnetfiddle.net/MJQDAO

5 Answers

Answers 1

You are looking for (minimum) edit distance / (minimum) edit sequence. You can find the theory of the process here:

https://web.stanford.edu/class/cs124/lec/med.pdf

Let's implement (simplest) Levenstein Distance / Sequence algorithm (for details see https://en.wikipedia.org/wiki/Levenshtein_distance). Let's start from helper classes (I've changed a bit your implementation of them):

  public enum EditOperationKind : byte {     None,    // Nothing to do     Add,     // Add new character     Edit,    // Edit character into character (including char into itself)     Remove,  // Delete existing character   };    public struct EditOperation {     public EditOperation(char valueFrom, char valueTo, EditOperationKind operation) {       ValueFrom = valueFrom;       ValueTo = valueTo;        Operation = valueFrom == valueTo ? EditOperationKind.None : operation;     }      public char ValueFrom { get; }     public char ValueTo {get ;}     public EditOperationKind Operation { get; }      public override string ToString() {       switch (Operation) {         case EditOperationKind.None:           return $"'{ValueTo}' Equal";         case EditOperationKind.Add:           return $"'{ValueTo}' Add";         case EditOperationKind.Remove:           return $"'{ValueFrom}' Remove";         case EditOperationKind.Edit:           return $"'{ValueFrom}' to '{ValueTo}' Edit";         default:           return "???";       }     }   } 

As far as I can see from the examples provided we don't have any edit operation, but add + remove; that's why I've put editCost = 2 when insertCost = 1, int removeCost = 1 (in case of tie: insert + remove vs. edit we put insert + remove). Now we are ready to implement Levenstein algorithm:

public static EditOperation[] EditSequence(   string source, string target,    int insertCost = 1, int removeCost = 1, int editCost = 2) {    if (null == source)     throw new ArgumentNullException("source");   else if (null == target)     throw new ArgumentNullException("target");    // Forward: building score matrix    // Best operation (among insert, update, delete) to perform    EditOperationKind[][] M = Enumerable     .Range(0, source.Length + 1)     .Select(line => new EditOperationKind[target.Length + 1])     .ToArray();    // Minimum cost so far   int[][] D = Enumerable     .Range(0, source.Length + 1)     .Select(line => new int[target.Length + 1])     .ToArray();    // Edge: all removes   for (int i = 1; i <= source.Length; ++i) {     M[i][0] = EditOperationKind.Remove;     D[i][0] = removeCost * i;   }    // Edge: all inserts    for (int i = 1; i <= target.Length; ++i) {     M[0][i] = EditOperationKind.Add;     D[0][i] = insertCost * i;   }    // Having fit N - 1, K - 1 characters let's fit N, K   for (int i = 1; i <= source.Length; ++i)     for (int j = 1; j <= target.Length; ++j) {       // here we choose the operation with the least cost       int insert = D[i][j - 1] + insertCost;       int delete = D[i - 1][j] + removeCost;       int edit = D[i - 1][j - 1] + (source[i - 1] == target[j - 1] ? 0 : editCost);        int min = Math.Min(Math.Min(insert, delete), edit);        if (min == insert)          M[i][j] = EditOperationKind.Add;       else if (min == delete)         M[i][j] = EditOperationKind.Remove;       else if (min == edit)         M[i][j] = EditOperationKind.Edit;        D[i][j] = min;     }    // Backward: knowing scores (D) and actions (M) let's building edit sequence   List<EditOperation> result =      new List<EditOperation>(source.Length + target.Length);    for (int x = target.Length, y = source.Length; (x > 0) || (y > 0);) {     EditOperationKind op = M[y][x];      if (op == EditOperationKind.Add) {       x -= 1;       result.Add(new EditOperation('\0', target[x], op));     }     else if (op == EditOperationKind.Remove) {       y -= 1;       result.Add(new EditOperation(source[y], '\0', op));     }     else if (op == EditOperationKind.Edit) {       x -= 1;       y -= 1;       result.Add(new EditOperation(source[y], target[x], op));     }     else // Start of the matching (EditOperationKind.None)       break;   }    result.Reverse();    return result.ToArray(); } 

Demo:

var sequence = EditSequence("asdfghjk", "wsedrftr");   Console.Write(string.Join(Environment.NewLine, sequence)); 

Outcome:

'a' Remove 'w' Add 's' Equal 'e' Add 'd' Equal 'r' Add 'f' Equal 'g' Remove 'h' Remove 'j' Remove 'k' Remove 't' Add 'r' Add 

Answers 2

I'll go out on a limb here and provide an algorithm that's not the most efficient, but is easy to reason about.

Let's cover some ground first:

1) Order matters

string before = "bar foo" string after = "foo bar" 

Even though "bar" and "foo" occur in both strings, "bar" will need to be removed and added again later. This also tells us it's the after string that gives us the order of chars we're interested in, we want "foo" first.

2) Order over count

Another way to look at it, is that some chars may never get their turn.

string before = "abracadabra" string after = "bar bar" 

Only the bold chars of "bar bar", get their say in "abracadabra". Even though we've got two b's in both strings, only the first one counts. By the time we get to the second b in "bar bar" the second b in "abracadabra" has already been passed, when we were looking for the first occurrence of 'r'.

3) Barriers

Barriers are the chars that exist in both strings, taking order and count into consideration. This already suggests a set might not be the most appropriate data structure, as we would lose count.

For an input

string before = "pinata" string after = "accidental" 

We get (pseudocode)

var barriers = { 'a', 't', 'a' } 

"pinata"

"accidental"

Let's follow the execution flow:

  • 'a' is the first barrier, it's also the first char of after so everything prepending the first 'a' in before can be removed. "pinata" -> "ata"
  • the second barrier is 't', it's not at the next position in our after string, so we can insert everything in between. "ata" -> "accidenta"
  • the third barrier 'a' is already at the next position, so we can move to the next barrier without doing any real work.
  • there are no more barriers, but our string length is still less than that of after, so there will be some post processing. "accidenta" -> "accidental"

Note 'i' and 'n' don't get to play, again, order over count.


Implementation

We've established that order and count matter, a Queue comes to mind.

static public List<Difference> CalculateDifferences(string before, string after) {     List<Difference> result = new List<Difference>();     Queue<char> barriers = new Queue<char>();      #region Preprocessing     int index = 0;     for (int i = 0; i < after.Length; i++)     {         // Look for the first match starting at index         int match = before.IndexOf(after[i], index);         if (match != -1)         {             barriers.Enqueue(after[i]);             index = match + 1;         }     }     #endregion      #region Queue Processing     index = 0;     while (barriers.Any())     {         char barrier = barriers.Dequeue();         // Get the offset to the barrier in both strings,          // ignoring the part that's already been handled         int offsetBefore = before.IndexOf(barrier, index) - index;         int offsetAfter = after.IndexOf(barrier, index) - index;         // Remove prefix from 'before' string         if (offsetBefore > 0)         {             RemoveChars(before.Substring(index, offsetBefore), result);             before = before.Substring(offsetBefore);         }         // Insert prefix from 'after' string         if (offsetAfter > 0)         {             string substring = after.Substring(index, offsetAfter);             AddChars(substring, result);             before = before.Insert(index, substring);             index += substring.Length;         }         // Jump over the barrier         KeepChar(barrier, result);         index++;     }     #endregion      #region Post Queue processing     if (index < before.Length)     {         RemoveChars(before.Substring(index), result);     }     if (index < after.Length)     {         AddChars(after.Substring(index), result);     }     #endregion      return result; }  static private void KeepChar(char barrier, List<Difference> result) {     result.Add(new Difference()     {         c = barrier,         op = Operation.Equal     }); }  static private void AddChars(string substring, List<Difference> result) {     result.AddRange(substring.Select(x => new Difference()     {         c = x,         op = Operation.Add     })); }  static private void RemoveChars(string substring, List<Difference> result) {     result.AddRange(substring.Select(x => new Difference()     {         c = x,         op = Operation.Remove     })); } 

Answers 3

I tested with 3 examples above, and it returns the expected result properly and perfectly.

        int flag = 0;         int flag_2 = 0;          string a = "asdfghjk";         string b = "wsedrftr";          char[] array_a = a.ToCharArray();         char[] array_b = b.ToCharArray();          for (int i = 0,j = 0, n= 0; i < array_b.Count(); i++)         {                //Execute 1 time until reach first equal character                if(i == 0 && a.Contains(array_b[0]))             {                 while (array_a[n] != array_b[0])                 {                     Console.WriteLine(String.Concat(array_a[n], " : Remove"));                     n++;                 }                 Console.WriteLine(String.Concat(array_a[n], " : Equal"));                 n++;             }             else if(i == 0 && !a.Contains(array_b[0]))             {                 Console.WriteLine(String.Concat(array_a[n], " : Remove"));                 n++;                 Console.WriteLine(String.Concat(array_b[0], " : Add"));             }               else             {                 if(n < array_a.Count())                 {                     if (array_a[n] == array_b[i])                     {                         Console.WriteLine(String.Concat(array_a[n], " : Equal"));                         n++;                     }                     else                     {                         flag = 0;                         for (int z = n; z < array_a.Count(); z++)                         {                                                           if (array_a[z] == array_b[i])                             {                                 flag = 1;                                 break;                             }                                                                                       }                          if (flag == 0)                         {                             flag_2 = 0;                             for (int aa = i; aa < array_b.Count(); aa++)                             {                                 for(int bb = n; bb < array_a.Count(); bb++)                                 {                                     if (array_b[aa] == array_a[bb])                                     {                                         flag_2 = 1;                                         break;                                     }                                 }                             }                              if(flag_2 == 1)                             {                                 Console.WriteLine(String.Concat(array_b[i], " : Add"));                             }                             else                             {                                 for (int z = n; z < array_a.Count(); z++)                                 {                                     Console.WriteLine(String.Concat(array_a[z], " : Remove"));                                     n++;                                 }                                  Console.WriteLine(String.Concat(array_b[i], " : Add"));                             }                          }                         else                         {                             Console.WriteLine(String.Concat(array_a[n], " : Remove"));                             i--;                             n++;                         }                      }                 }                 else                 {                     Console.WriteLine(String.Concat(array_b[i], " : Add"));                 }              }          }//end for           MessageBox.Show("Done");       //OUTPUT CONSOLE:     /*     a : Remove     w : Add     s : Equal     e : Add     d : Equal     r : Add     f : Equal     g : Remove     h : Remove     j : Remove     k : Remove     t : Add     r : Add     */   

Answers 4

Here might be another solution, full code and commented. However the result of your first original example is inverted :

class Program {     enum CharState     {         Add,         Equal,         Remove     }      struct CharResult     {         public char c;         public CharState state;     }      static void Main(string[] args)     {         string a = "asdfghjk";         string b = "wsedrftr";         while (true)         {             Console.WriteLine("Enter string a (enter to quit) :");             a = Console.ReadLine();             if (a == string.Empty)                 break;             Console.WriteLine("Enter string b :");             b = Console.ReadLine();              List<CharResult> result = calculate(a, b);             DisplayResults(result);         }         Console.WriteLine("Press a key to exit");         Console.ReadLine();     }      static List<CharResult> calculate(string a, string b)     {         List<CharResult> res = new List<CharResult>();         int i = 0, j = 0;          char[] array_a = a.ToCharArray();         char[] array_b = b.ToCharArray();          while (i < array_a.Length && j < array_b.Length)         {             //For the current char in a, we check for the equal in b             int index = b.IndexOf(array_a[i], j);             if (index < 0) //not found, this char should be removed             {                 res.Add(new CharResult() { c = array_a[i], state = CharState.Remove });                 i++;             }             else             {                 //we add all the chars between B's current index and the index                 while (j < index)                 {                     res.Add(new CharResult() { c = array_b[j], state = CharState.Add });                     j++;                 }                 //then we say the current is the same                 res.Add(new CharResult() { c = array_a[i], state = CharState.Equal });                 i++;                 j++;             }         }          while (i < array_a.Length)         {             //b is now empty, we remove the remains             res.Add(new CharResult() { c = array_a[i], state = CharState.Remove });             i++;         }         while (j < array_b.Length)         {             //a has been treated, we add the remains             res.Add(new CharResult() { c = array_b[j], state = CharState.Add });             j++;         }          return res;     }      static void DisplayResults(List<CharResult> results)     {         foreach (CharResult r in results)         {             Console.WriteLine($"'{r.c}' - {r.state}");         }     } } 

Answers 5

If you want to have a precise comparison between two strings, you must read and understand Levenshtein Distance. by using this algorithm you can precisely calculate rate of similarity between two string and also you can backtrack the algorithm to get the chain of changing on the second string. this algorithm is a important metric for Natural Language Processing also.

there are some other benefits and it's need time to learn.

in this link there is a C# version of Levenshtein Distance :

https://www.dotnetperls.com/levenshtein

Read More

Monday, August 13, 2018

Curious memory consumption of pandas.unique()

Leave a Comment

While profiling the memory consumption of my algorithm, I was surprised that sometimes for smaller inputs more memory was needed.

It all boils down to the following usage of pandas.unique():

import numpy as np import pandas as pd import sys  N=int(sys.argv[1])  a=np.arange(N, dtype=np.int64) b=pd.unique(a) 

with N=6*10^7 it needs 3.7GB peak memory, but with N=8*10^7 "only" 3GB.

Scanning different input-size yields the following graph:

enter image description here

Out of curiosity and for self-education: How can the counterintuitive behavior (i.e. more memory for smaller input size) around N=5*10^7, N=1.3*10^7 be explained?


Here are the scripts for producing the memory consumption graph on Linux:

pandas_unique_test.py:

import numpy as np import pandas as pd import sys  N=int(sys.argv[1])     a=np.arange(N, dtype=np.int64) b=pd.unique(a) 

show_memory.py:

import sys import matplotlib.pyplot as plt    ns=[] mems=[] for line in sys.stdin.readlines():     n,mem = map(int, line.strip().split(" "))     ns.append(n)     mems.append(mem) plt.plot(ns, mems, label='peak-memory') plt.xlabel('n') plt.ylabel('peak memory in KB') ymin, ymax = plt.ylim() plt.ylim(0,ymax) plt.legend() plt.show() 

run_perf_test.sh:

WRAPPER="/usr/bin/time -f%M" #peak memory in Kb N=1000000 while [ $N -lt 100000000 ] do    printf "$N "    $WRAPPER python pandas_unique_test.py $N    N=`expr $N + 1000000` done  

And now:

sh run_perf_tests.sh  2>&1 | python show_memory.py 

1 Answers

Answers 1

Let's see...

pandas.unique says it's a "hash-table based unique".

It calls this function to acquire the correct hash table implementation for your data, namely htable.Int64HashTable.

The hash table is initialized with size_hint = the length of your value vector. That means kh_resize_DTYPE(table, size_hint) gets called.

Those functions are defined (templated) here in khash.h.

It seems to allocate (size_hint >> 5) * 4 + (size_hint) * 8 * 2 bytes of memory for the buckets (maybe more, maybe less, I might be off here).

Then, HashTable.unique() is called.

It allocates an empty Int64Vector, which seem to quadruple their size whenever they get filled, starting from 128.

It then iterates over your values, figuring out whether they're in the hash table; if not, they get added to both the hash table and the vector. (This is where the vector may grow; the hash table shouldn't need to grow due to the size hint.)

Finally, a NumPy ndarray is made to point at the vector.

So uh, I think you're seeing the vector size quadrupling at certain thresholds (which should be, if my late-night math stands,

>>> [2 ** (2 * i - 1) for i in range(4, 20)] [     128,     512,     2048,     8192,     32768,     131072,     524288,     2097152,     8388608,     33554432,     134217728,     536870912,     2147483648,     8589934592,     34359738368,     137438953472,     ..., ] 

Hope this sheds some light at things :)

Read More

Thursday, August 9, 2018

Combining color hex blending algorithm with standard CMYK colored buttons

Leave a Comment

I am trying to blend any combination of colors using colored buttons which output a specific hex number in combination with a slider bar from bootstrap that allows the user to indicate the percentage of color they want to use.

I couldn't get the slider to properly run and I'm not sure why.

I do need some help combining the pieces of code into a working algorithm that I can use for my art class. The full code can be found here:

https://jsfiddle.net/mw7optL5/289/

//Hex blending algorithm var mix = function(color_1, color_2, weight) {   function d2h(d) { return d.toString(16); }  // convert a decimal value to hex   function h2d(h) { return parseInt(h, 16); } // convert a hex value to decimal     weight = (typeof(weight) !== 'undefined') ? weight : 50; // set the weight to 50%, if that argument is omitted    var color = "#";    for(var i = 0; i <= 5; i += 2) { // loop through each of the 3 hex pairs—red, green, and blue     var v1 = h2d(color_1.substr(i, 2)), // extract the current pairs         v2 = h2d(color_2.substr(i, 2)),          // combine the current pairs from each source color, according to the specified weight         val = d2h(Math.floor(v2 + (v1 - v2) * (weight / 100.0)));       while(val.length < 2) { val = '0' + val; } // prepend a '0' if val results in a single digit      color += val; // concatenate val to our new color string   }    return color; // PROFIT!   };  //buttons  <button class="Yellow" onclick="mix('#ffff00')">Yellow</button>  <button class="Magenta" onclick="mix('#D9017A')">Magenta</button>  <button class="Cyan" onclick="mix('#009FDF')">Cyan</button>  <button class="Black" onclick="mix('#2D2926')">Black</button>  <button class="Orange021" onclick="mix('#FE5000')">Orange</button>  //slider bar <input id="ex1" data-slider-id='ex1Slider' type="text" data-slider-min="0" data-slider-max="20" data-slider-step="1" data-slider-value="14"/>  var slider = new Slider('#ex1', {     formatter: function(value) {         return 'Current value: ' + value;     } }); 

1 Answers

Answers 1

This just solves the mechanics I think you want (run and save the slider value properly), and it could be used as starter point. This doesn't solve the formula of any combination of colors as you want.

  • Create an object based in the buttons with all percentages (0 default) like { Yellow: 0, Magenta: 0, Cyan: 0, Black: 0 }
  • Add an event listener to each button .click (jquery) or .addEventListener (pure js)
  • In the callback of each button, save the color and percentage into the object, like colorPercentages[colors[i].getAttribute("id")] = $("#ex1").val(); where $("#ex1").val() is the slider's value.
  • Draw the color based in some formula, like $("body").css('background-color', 'rgb(' + r + ', ' + g + ', ' + b + ')');

In this case I've used this formula:

Red = 255 x ( 1 - Cyan / 100 ) x ( 1 - Black / 100 )

Green = 255 x ( 1 - Magenta / 100 ) x ( 1 - Black / 100 )

Blue = 255 x ( 1 - Yellow / 100 ) x ( 1 - Black / 100 )

Here a full functional example based on @barbsan modifications: https://jsfiddle.net/5dsgbne9/

That example is using the color button click event, but you could use the inverse logic as well, changing the color when user moves the slider with something like:

$("#ex1").slider().on('slideStop', function(ev){     //code }); 

As I understand, a blending algorithm it's pretty difficult to achieve, this question and the answers give some good approaches: Mixing two colors "naturally" in javascript, but you still need to mix n colors, not just 2, which is even more complex. Fortunately it seems to be not impossible.

Read More

Tuesday, May 29, 2018

How can I properly write this shader function in JS?

Leave a Comment

What I want to happen:

For testing a game art style I thought of, I want to render a 3D world in pixel-art form. So for example, take a scene like this (but rendered with certain coloring / style so as to look good once pixelated):

Original Image

And make it look something like this:

Pixelated Image

By playing with different ways of styling the 3D source I think the pixelated output could look nice. Of course to get this effect one just sizes the image down to ~80p and upscales it to 1080p with nearest neighbor resampling. But it's more efficient to render straight to an 80p canvas to begin with and just do the upscaling.

This is not typically how one would use a shader, to resize a bitmap in nearest neighbor format, but the performance on it is better than any other way I've found to make such a conversion in real time.

My code:

My buffer for the bitmap is stored in row major, as r1, g1, b1, a1, r2, g2, b2, a2... and I'm using gpu.js which essentially converts this JS func into a shader. My goal is to take one bitmap and return one at larger scale with nearest-neighbor scaling, so each pixel becomes a 2x2 square or 3x3 and so on. Assume inputBuffer is a scaled fraction of size of the output determined by the setOutput method.

var pixelateMatrix = gpu.createKernel(function(inputBuffer, width, height, scale) {   var y = Math.floor((this.thread.x / (width[0] * 4)) / scale[0]);   var x = Math.floor((this.thread.x % (width[0] * 4)) / scale[0]);   var remainder = this.thread.x % 4;   return inputBuffer[(x * y) + remainder];  }).setOutput([width * height * 4]); 

JSFiddle

Keep in mind it's iterating over a new buffer of the full size output, so I have to find the correct coordinates that will exist in the smaller sourceBuffer based on the current index in the outputBuffer (index is exposed by the lib as this.thread.x).

What's happening instead:

This, instead of making a nearest neighbor upscale, is making a nice little rainbow (above is the small normal render, below is the result of the shader, and to the right you can see some debug logging with stats about the input and output buffers):

Result

What am I doing wrong?

Note: I asked a related question here, Is there a simpler (and still performant) way to upscale a canvas render with nearest neighbor resampling?

2 Answers

Answers 1

Update 1 - 25th May 2018

I was able to resolve most of the issues. There were quite a few

  1. The logic of transformation was wrong, also the data was coming flipped for some reason, so I flipped cols and rows to start from bottom right

    var pixelateMatrix = gpu.createKernel(function(inputBuffer, width, height, scale) { var size = width[0] * height[0] * 4; var current_index = Math.floor((size - this.thread.x)/4);  var row = Math.floor(current_index / (width[0] * scale[0]) ); var col = Math.floor((current_index % width[0])/scale[0]); var index_old = Math.floor(row * (width[0] / scale[0])) + width[0] - col; var remainder = this.thread.x % 4; return inputBuffer[index_old * 4 + remainder];  }).setOutput([width * height * 4]); 
  2. You were using width and height in floats, which I have changed to be calculated first and then scaled

    var smallWidth = Math.floor(window.innerWidth / scale); var smallHeight = Math.floor(window.innerHeight / scale);  var width = smallWidth * scale; var height = smallHeight * scale;   var rt = new THREE.WebGLRenderTarget(smallWidth, smallHeight); var frameBuffer = new Uint8Array(smallHeight * smallHeight * 4); var outputBuffer = new Uint8ClampedArray(width * height * 4); 
  3. The canvas size was set to while inner width and height, you need to set it to just the image width and height

    context = canvas.getContext('2d'); canvas.width = width; canvas.height = height; 

Below is the final JSFiddle for the same

https://jsfiddle.net/are5Lbw8/6/

Results:

Working Upscale

Final Code for reference

var container; var camera, scene, renderer; var mouseX = 0; var mouseY = 0; var scale = 4; var windowHalfX = window.innerWidth / 2; var windowHalfY = window.innerHeight / 2; var smallWidth = Math.floor(window.innerWidth / scale); var smallHeight = Math.floor(window.innerHeight / scale);  var width = smallWidth * scale; var height = smallHeight * scale;   var rt = new THREE.WebGLRenderTarget(smallWidth, smallHeight); var frameBuffer = new Uint8Array(smallHeight * smallHeight * 4); var outputBuffer = new Uint8ClampedArray(width * height * 4); var output; var divisor = 2; var divisorHalf = divisor / 2; var negativeDivisorHalf = -1 * divisorHalf; var canvas; var context; var gpu = new GPU();  var pixelateMatrix = gpu.createKernel(function(inputBuffer, width, height, scale) { /*   var y = Math.floor((this.thread.x / (width[0] * 4)) / scale[0]);   var x = Math.floor((this.thread.x % (width[0] * 4)) / scale[0]);   var remainder = this.thread.x % 4;   return inputBuffer[(x * y) + remainder];    */    var size = width[0] * height[0] * 4;    var current_index = Math.floor((size - this.thread.x)/4);     var row = Math.floor(current_index / (width[0] * scale[0]) );    var col = Math.floor((current_index % width[0])/scale[0]);    var index_old = Math.floor(row * (width[0] / scale[0])) + width[0] - col;    var remainder = this.thread.x % 4;    return inputBuffer[index_old * 4 + remainder];  }).setOutput([width * height * 4]); console.log(window.innerWidth); console.log(window.innerHeight); init(); animate();  function init() {   container = document.createElement('div');   document.body.appendChild(container);   canvas = document.createElement('canvas');   document.body.appendChild(canvas);   context = canvas.getContext('2d');   canvas.width = width;   canvas.height = height;   camera = new THREE.PerspectiveCamera(45, window.innerWidth / window.innerHeight, 1, 2000);   camera.position.z = 100;    // scene   scene = new THREE.Scene();   var ambient = new THREE.AmbientLight(0xbbbbbb);   scene.add(ambient);   var directionalLight = new THREE.DirectionalLight(0xdddddd);   directionalLight.position.set(0, 0, 1);   scene.add(directionalLight);    // texture   var manager = new THREE.LoadingManager();   manager.onProgress = function(item, loaded, total) {     console.log(item, loaded, total);   };   var texture = new THREE.Texture();   var onProgress = function(xhr) {     if (xhr.lengthComputable) {       var percentComplete = xhr.loaded / xhr.total * 100;       console.log(Math.round(percentComplete, 2) + '% downloaded');     }   };   var onError = function(xhr) {};   var imgLoader = new THREE.ImageLoader(manager);   imgLoader.load('https://i.imgur.com/P6158Su.jpg', function(image) {     texture.image = image;     texture.needsUpdate = true;   });    // model   var objLoader = new THREE.OBJLoader(manager);   objLoader.load('https://s3-us-west-2.amazonaws.com/s.cdpn.io/286022/Bulbasaur.obj', function(object) {     object.traverse(function(child) {       if (child instanceof THREE.Mesh) {         child.material.map = texture;       }     });     object.scale.x = 45;     object.scale.y = 45;     object.scale.z = 45;     object.rotation.y = 3;     object.position.y = -10.5;     scene.add(object);   }, onProgress, onError);   renderer = new THREE.WebGLRenderer({     alpha: true,     antialias: false   });   renderer.setPixelRatio(window.devicePixelRatio);   renderer.setSize(smallWidth, smallHeight);   container.appendChild(renderer.domElement);   renderer.context.webkitImageSmoothingEnabled = false;   renderer.context.mozImageSmoothingEnabled = false;   renderer.context.imageSmoothingEnabled = false;   document.addEventListener('mousemove', onDocumentMouseMove, false);   window.addEventListener('resize', onWindowResize, false); }  function onWindowResize() {   windowHalfX = (window.innerWidth / 2) / scale;   windowHalfY = (window.innerHeight / 2) / scale;   camera.aspect = (window.innerWidth / window.innerHeight) / scale;   camera.updateProjectionMatrix();   renderer.setSize(smallWidth, smallHeight); }  function onDocumentMouseMove(event) {    mouseX = (event.clientX - windowHalfX) / scale;   mouseY = (event.clientY - windowHalfY) / scale;  }  function animate() {   requestAnimationFrame(animate);   render(); }  var flag = 0;  function render() {   camera.position.x += (mouseX - camera.position.x) * .05;   camera.position.y += (-mouseY - camera.position.y) * .05;   camera.lookAt(scene.position);   renderer.render(scene, camera);   renderer.render(scene, camera, rt);   renderer.readRenderTargetPixels(rt, 0, 0, smallWidth, smallHeight, frameBuffer);   //console.time('gpu');   console.log(frameBuffer, [width], [height], [scale]);   var outputBufferRaw = pixelateMatrix(frameBuffer, [width], [height], [scale]);   //console.timeEnd('gpu');   if (flag < 15) {     console.log('source', frameBuffer);     console.log('output', outputBufferRaw);      var count = 0;     for (let i = 0; i < frameBuffer.length; i++) {       if (frameBuffer[i] != 0) {         count++;       }     }     console.log('source buffer length', frameBuffer.length)     console.log('source non zero', count);      var count = 0;     for (let i = 0; i < outputBufferRaw.length; i++) {       if (outputBufferRaw[i] != 0) {         count++;       }     }     console.log('output buffer length', outputBufferRaw.length)     console.log('output non zero', count);   }   outputBuffer = new Uint8ClampedArray(outputBufferRaw);   output = new ImageData(outputBuffer, width, height);   context.putImageData(output, 0, 0);   flag++; } 

Original Answer

I have gotten close but two issues are left

  1. The image is getting inverted
  2. Sometimes your inputBuffer size is not a multiple of 4, which causes it to misbehave.

Below is code I used

var pixelateMatrix = gpu.createKernel(function(inputBuffer, width, height, scale) {    var current_index = Math.floor(this.thread.x/4);     var row = Math.floor(current_index / (width[0] * scale[0]) );    var col = Math.floor((current_index % width[0])/scale[0]);    var index_old = Math.floor(row * (width[0] / scale[0])) + col;    var remainder = this.thread.x % 4;    return inputBuffer[index_old * 4 + remainder];  }).setOutput([width * height * 4]); 

Below is the JSFiddle

https://jsfiddle.net/are5Lbw8/

And below is the current output

Current State

Answers 2

I think the function should look like:

var pixelateMatrix = gpu.createKernel(function(inputBuffer, width, height, scale) {     var x = Math.floor((this.thread.x / (width[0] * 4)) / scale[0]);     var y = Math.floor((this.thread.x % (width[0] * 4)) / scale[0]);     var finalval = y * (Math.floor(width[0]/scale[0]) * 4) + (x * 4);     var remainder = this.thread.x % 4;     return inputBuffer[finalval + remainder]; }).setOutput([width * height * 4]); 

Basically, get x and y in similar manner as you, scale x and y, then converting back from (x,y) you multiple the y value by the new scaled width and add the x value. Not sure how you got x*y for that part of it.

Read More

Saturday, May 26, 2018

Optimal weights subset sum using backtracking

Leave a Comment

I'm trying to solve a problem. I have some weights. [2,7,20,70,200,700] After a given input, for example 1507, it should return the optimal combination of those weights. Which in this case would be [700,200,200,200,200,7]. Unfortunately my algorithm is returning [700, 700, 70, 20, 7, 2, 2, 2, 2, 2]. When I say optimal I mean that my algorithm should use the fewest number of weights as possible.

func solve(_ targetValue: Int, weights: inout [Int]) -> [Int] {     // The used weights to store     var usedWeights: [Int] = []     // The current total value for the calculation     var total = targetValue     // If target value is 0 add it to the array and just return it     if targetValue == 0 { usedWeights.append(0); return usedWeights }     // Loop through the weights     for weight in weights.reversed() {      while weight <= total {             total -= weight             usedWeights.append(weight)         }     }    // If still weights are not found call the function recursively again but remove the last element before     if total != 0 {         weights.removeLast()         return solve(targetValue, weights: &weights)     }     return usedWeights }  var newWeights: [Int] = [2,7,20,70,200,700] print(solve(1507, weights: &newWeights)) 

How can I fix this? What am I doing wrong? The important thing is to solve it using backtracking. I really appreciate your help.

1 Answers

Answers 1

Here is a possible recursive solution:

// Find the shortest combination of (possibly repeating) numbers in `values` // whose sum is exactly `target`, and whose count is less than `limit`. // Return `nil` if no such combination exist. // // `target` must be non-negative, and `values` an array of positive // numbers in decreasing order. // func solveHelper(target: Int, values: ArraySlice<Int>, limit: Int) -> [Int]? {     if target == 0 {         return [] // Target reached exactly.     }     guard let first = values.first else {         return nil // No values left, target cannot be reached.     }     if target/first >= limit {         return nil // Target cannot be reached with less than `limit` values.     }      var best: [Int]? = nil   // Best solution found so far     var bestCount = limit // Number of values in best solution      for n in stride(from: target/first, through: 0, by: -1) {         if let s = solveHelper(target: target - n * first, values: values.dropFirst(), limit: bestCount - n) {             best = s + repeatElement(first, count: n)             bestCount = s.count + n         }     }      return best }  // Find the shortest combination of (possibly repeating) values in `values` // whose sum is exactly `target`. Return `nil` if no such combination exist. // // `target` must be non-negative, and `values` an array of positive // numbers. // func solve(target: Int, values: [Int]) -> [Int]? {     return solveHelper(target: target, values: ArraySlice(values.sorted(by: >)), limit: Int.max) } 

Examples:

print(solve(target: 1507, values: [2, 7, 20, 70, 200, 700]) as Any) // Optional([7, 200, 200, 200, 200, 700])  print(solve(target: 1507, values: [20, 70, 200, 700]) as Any) // nil  print(solve(target: 6, values: [1, 3, 4]) as Any) // Optional([3, 3])  print(solve(target: 0, values: [1, 3, 4]) as Any) // Optional([]) 

Some explanations:

  • It is assumed that target is non-negative and that all values are positive.
  • solve sorts the array in descending order and passes it as an ArraySlice to the recursive helper function. This helps to avoid further copies of the element storage, when values.dropFirst() is passed to the recursive call.
  • solveHelper starts “greedy” with the maximal possible number of the first (i.e. largest) value, calls itself recursively for the remaining target sum and values, then repeats the process with less copies of the first value, keeping track of the shortest solution found so far.
  • In order to “prune” the recursion tree, a limit is passed to the recursive call. As an example, if 1507 = 700 + 200 + 200 + 200 + 200 + 7 has already been found then there is no need anymore to sum only values in [2, 7, 20, 70], that would only give longer solutions.
  • The function runs reasonably fast in my tests for the given array. For a larger number of possible values you probably need a more sophisticated algorithm, such as the dynamic programming approach described in Change-making problem.
Read More

Wednesday, May 23, 2018

Parse 'ul' and 'ol' tags

Leave a Comment

I have to handle deep nesting of ul, ol, and li tags. I need to give the same view as we are giving in the browser. I want to achieve the following example in a pdf file:

 text = " <body>     <ol>         <li>One</li>         <li>Two              <ol>                 <li>Inner One</li>                 <li>inner Two                      <ul>                         <li>hey                              <ol>                                 <li>hiiiiiiiii</li>                                 <li>why</li>                                 <li>hiiiiiiiii</li>                             </ol>                         </li>                         <li>aniket </li>                     </li>                 </ul>                 <li>sup </li>                 <li>there </li>             </ol>             <li>hey </li>             <li>Three</li>         </li>     </ol>     <ol>         <li>Introduction</li>         <ol>             <li>Introduction</li>         </ol>         <li>Description</li>         <li>Observation</li>         <li>Results</li>         <li>Summary</li>     </ol>     <ul>         <li>Introduction</li>         <li>Description              <ul>                 <li>Observation                      <ul>                         <li>Results                              <ul>                                 <li>Summary</li>                             </ul>                         </li>                     </ul>                 </li>             </ul>         </li>         <li>Overview</li>     </ul> </body>" 

I have to use prawn for my task. But prawn doesn't support HTML tags. So, I came up with a solution using nokogiri:. I am parsing and later removing the tags with gsub. The below solution I have written for a part of the above content but the problem is ul and ol can vary.

     RULES = {   ol: {     1 => ->(index) { "#{index + 1}. " },     2 => ->(index) { "#{}" },     3 => ->(index) { "#{}" },     4 => ->(index) { "#{}" }   },   ul: {     1 => ->(_) { "\u2022 " },     2 => ->(_) { "" },     3 => ->(_) { "" },     4 => ->(_) { "" },   } }  def ol_rule(group, deepness: 1)   group.search('> li').each_with_index do |item, i|     prefix = RULES[:ol][deepness].call(i)     item.prepend_child(prefix)     descend(item, deepness + 1)   end end  def ul_rule(group, deepness: 1)   group.search('> li').each_with_index do |item, i|     prefix = RULES[:ul][deepness].call(i)     item.prepend_child(prefix)     descend(item, deepness + 1)   end end  def descend(item, deepness)   item.search('> ol').each do |ol|     ol_rule(ol, deepness: deepness)   end   item.search('> ul').each do |ul|     ul_rule(ul, deepness: deepness)   end end  doc = Nokogiri::HTML.fragment(text)  doc.search('ol').each do |group|   ol_rule(group, deepness: 1) end  doc.search('ul').each do |group|   ul_rule(group, deepness: 1) end     puts doc.inner_text   1. One 2. Two  1. Inner One 2. inner Two  • hey  1. hiiiiiiiii 2. why 3. hiiiiiiiii   • aniket    3. sup  4. there   3. hey  4. Three    1. Introduction  1. Introduction  2. Description 3. Observation 4. Results 5. Summary    • Introduction • Description  • Observation  • Results  • Summary       • Overview 

Problem

1) What I want to achieve is how to handle space when working with ul and ol tags
2) How to handle deep nesting when li come inside ul or li come inside ol

2 Answers

Answers 1

I've come up with a solution that handles multiple identations with configurable numeration rules per level:

require 'nokogiri' ROMANS = %w[i ii iii iv v vi vii viii ix]  RULES = {   ol: {     1 => ->(index) { "#{index + 1}. " },     2 => ->(index) { "#{('a'..'z').to_a[index]}. " },     3 => ->(index) { "#{ROMANS.to_a[index]}. " },     4 => ->(index) { "#{ROMANS.to_a[index].upcase}. " }   },   ul: {     1 => ->(_) { "\u2022 " },     2 => ->(_) { "\u25E6 " },     3 => ->(_) { "* " },     4 => ->(_) { "- " },   } }  def ol_rule(group, deepness: 1)   group.search('> li').each_with_index do |item, i|     prefix = RULES[:ol][deepness].call(i)     item.prepend_child(prefix)     descend(item, deepness + 1)   end end  def ul_rule(group, deepness: 1)   group.search('> li').each_with_index do |item, i|     prefix = RULES[:ul][deepness].call(i)     item.prepend_child(prefix)     descend(item, deepness + 1)   end end  def descend(item, deepness)   item.search('> ol').each do |ol|     ol_rule(ol, deepness: deepness)   end   item.search('> ul').each do |ul|     ul_rule(ul, deepness: deepness)   end end  doc = Nokogiri::HTML.fragment(text)  doc.search('ol:root').each do |group|   binding.pry   ol_rule(group, deepness: 1) end  doc.search('ul:root').each do |group|   ul_rule(group, deepness: 1) end 

You can then remove the tags or use doc.inner_text depending on your environment.

Two caveats though:

  1. Your entry selector must be carefully selected. I used your snippet verbatim without root element, thus i had to use ul:root/ol:root. Maybe "body > ol" works for your situation too. Maybe selecting each ol/ul but than walking each and only find those, that have no list parent.
  2. Using your example verbatim, Nokogiri does not handle the last 2 list items of the first group ol very well ("hey", "Three") When parsing with nokogiri, thus elements already "left" their ol tree and got placed in the root tree.

Current Output:

  1. One   2. Two       a. Inner One       b. inner Two         ◦ hey         ◦ hey       3. hey       4. hey   hey   Three    1. Introduction     a. Introduction   2. Description   3. Observation   4. Results   5. Summary    • Introduction   • Description       ◦ Observation           * Results               - Summary   • Overview 

Answers 2

Whenever you are in a ol, li or ul element, you must recursively check for ol, li and ul. If there are none of them, return (what have been discovered as a substructure), if there are, call the same function on the new node and add its return value to the current structure.

You perform a different action on each node no matter where it is depending on its type and then the function automatically repackage everything.

Read More

Thursday, May 10, 2018

How to align “tracks” or modular objects in Unity ?

Leave a Comment

I'm developing a simple game, where user can place different but modular objects (for instance: tracks, road etc).

My question is: how to match and place different object when placed one near the other ?

My first approach is to create an hidden child object (a box) for each module objects, and put it in the border where is possible to place other object (see my image example), so i can use that coordinates (x,y,z) to align other object.

But i don't know if the best approach.

enter image description here

enter image description here

Thanks

2 Answers

Answers 1

Summary:

1.Define what is a "snapping point"

2.Define which is your threshold

3.Update new game object position

Little Explanation

1. So I suppose that you need a way to define which parts of the object are the "snapping points". Cause they can be clear in some examples, like a Cube, where the whole vertex could be snapping points, but it's hard to define that every vertex in amorphous objects.

A simple solution could be the one exposed by @PierreBaret, whic consists in define on your transform component which are the "snapping points". The other one is the one you propouse, creating empty game objects that will act as snapping points locations on the game object.

2.After having those snaped points, when you will drop your new gameObject, you need to define a threshold, as long as you don't want that every object snaps allways to the nearest game object.

3.So you define a minimum distance between snapping points, so if your snapping point is under that threshold, you will need to update it's position, to adjust to the the snapped point.

Visual Representation:

enter image description here

Note: The Threshold distance is showing just ONE of the 4 current threshold checks on the 4 vertex in the square, but this dark blue circle should be repilcate 3 more times, one for each green snapping point of the red square

Of course this method seems expensive, you can make some improvements like setting a first threshold between gameobjects, and if the gameObject is inside this threshold, then check snapping threshold distance.

Hope it helps!

Answers 2

Approach for arbitrary objects/models and deformable models.

[A] A physical approach would consider all the surfaces of the 2 objects, and you might need to check that objects don't overlap, using dot products between surfaces. That's a bit more expensive computing, but nothing nasty. If there is no match involved here, you'll be able to add matching features (see [B]). However, that's the only way to work with non predefined models or deformable models.

Approaches for matching simple and complex models

[B] Snapping points are a good thing but it's not sufficient alone. I think you need to make an object have:

  • a sparse representation (eg., complex oriented sphere to a cube),
  • and place key snapping points,
  • tagged by polarity or color, and eventually orientation (that's oriented snapping points); eg., in the case of rails, you'll want rails to snap {+} with {+} and forbid {+} with {-}. In the case of a more complex object, or when you have several orientations (eg., 2 faces of a surface, but only one is candidate for an pair of objects matching) you'll need more than 2 polarities, but 3 different ones per matching candidate surface or feature therefore the colors (or any enumeration). You need 3 different colors to make sure there is a unique 3D space configuration. You create something that is called in chemistry an enantiomer.
  • You can also use point pair features that describes the relative position and orientation of two oriented points, when an oriented surface is not appropriate.

References

Some are computer vision papers or book extracts, but they expose algorithms and concepts to achieve what I developed in my answer.

  1. Model Globally, Match Locally: Efficient and Robust 3D Object Recognition, Drost et al.

  2. 3D Models and Matching

Read More

Thursday, April 26, 2018

Is there a better way to guess possible unknown variables without brute force than I am doing? Machine learning?

Leave a Comment

I have a game with the following rules:

  • User is given fruit prices and has a chance to buy or sell items in their fruit basket every turn.

  • The user cannot make more than a 10% total change in their basket on a single turn.

  • Fruit prices change everyday and when multiplied by the quantities of items in the fruit basket, the total value of the basket changes relative to the fruit price changes everyday as well.
  • The program is only given the current price of all the fruits and the current value of the basket (current price of fruit * quantities for all items in the basket).
  • Based on these 2 inputs(all fruit prices and basket total value), the program tries to guess what items are in the basket.
  • Basket cannot hold more than 100 items but slots can be empty
  • The play can play several turns.

My goal is to accurately guess as computationally inexpensively as possible (read: no brute force) and scale if there are thousands of new fruits.

I am struggling to find an answer but in my mind it’s not hard. If I have the below table. I could study day 1 and get the following data:

Apple   1 Pears   2 Oranges 3  Basket Value = 217 

I can do a back of napkin calculation and assume, the weights in the basket are: 0 apple, 83 pears, and 17 Oranges equaling a basket value of 217.

The next day, the values of the fruits and basket changes. To (apple = 2, Pear 3, Oranges 5) with a basket value of 348. When I take my assumed weights above (0,83,17) I get a total value of 334 – not correct! Running this by my script, I see the closest match is 0 apples, 76 pears, 24 oranges which although does equal 348 when % change of factored in it’s a 38% change so it’s not possible!

I know I can completely brute force this but if I have 1000 fruits, it won’t scale. Not to jump on any band wagon but can something like a neural net quickly rule out the unlikely so I calculate large volumes of data? I think they has to be a more scalable/quicker way than pure brute force? Or is there any other type solution that could can get result?

Here is the raw data (remember program can only see prices and total basket value only):

enter image description here

Here's some brute force code (Thank you @paul Hankin for a cleaner example than mine):

def possibilities(value, prices):     for i in range(0, value+1, prices[0]):         for j in range(0, value+1-i, prices[1]):             k = value - i - j             if k % prices[2] == 0:                 yield i//prices[0], j//prices[1], k//prices[2]  def merge_totals(last, this, r):     ok = []     for t in this:         for l in last:             f = int(sum(l) * r)             if all(l[i] -f <= t[i] <= l[i] + f for i in range(len(l))):                 ok.append(t)                 break     return ok  days = [     (217, (1, 2, 3)),     (348, (2, 3, 5)),     (251, (1, 2, 4)), ]  ps = None for i, d in enumerate(days):     new_ps = list(possibilities(*d))     if ps is None:         ps = new_ps     ps = merge_totals(ps, new_ps, 0.10)      print('Day %d' % (i+1))     for p in ps:         print('Day %d,' % (i+1), 'apples: %s, pears: %s, oranges: %s' % p)     print 

Update - The info so far is awesome. Does it make sense to break the problem into two problems? One is generating the possibilities while the the other is finding the relationship between the possibilities(no more than a 10% daily change). By ruling out possibilities, couldn't that also be used to help only generate possibilities that are possible to begin with? I'm not sure the approach still but I do feel both problems are different but tightly related. Your thoughts?

Update 2 - there are a lot of questions around the % change. This is the total volume of items in the basket that can change. To use the game example, Imagine the store says - you can sell/return/buy fruits but they cannot be more than 10% of your last bill. So although the change in fruit prices can cause changes in your basket value, the user cannot take any action that would impact it by more than 10%. So if the value was 100, they can make changes that create get it to 110 but not more.

6 Answers

Answers 1

I hate to let you down but I really don't think a neural net will help at all for this problem, and IMO the best answer to your question is the advice "don't waste your time trying neural nets".

An easy rule of thumb for deciding whether or not neural networks are applicable is to think, "can an average adult human solve this problem reasonably well in a few seconds?" For problems like "what's in this image", "respond to this question", or "transcribe this audio clip", the answer is yes. But for your problem, the answer is a most definite no.

Neural networks have limitations, and one is that they don't deal well with highly logical problems. This is because the answers are generally not "smooth". If you take an image and slightly change a handful of pixels, the content of the image is still the same. If you take an audio clip and insert a few milliseconds of noise, a neural net will probably still be able to figure out what's said. But in your problem, change a single day's "total basket value" by only 1 unit, and your answer(s) will drastically change.

It seems that the only way to solve your problem is with a "classical" algorithmic approach. As currently stated, there might not be any algorithm better than brute force, and it might not be possible to rule out much. For example, what if every day has the property that all fruits are priced the same? The count of each fruit can vary, as long as the total number of fruits is fixed, so the number of possibilities is still exponential in the number of fruits. If your goal is to "produce a list of possibilities", then no algorithm can be better than exponential time since this list can be exponentially large in some cases.

It's interesting that part of your problem can be reduced to an integer linear program (ILP). Consider a single day, where you are given the basket total B and each fruit's cost c_i, for i=1 through i=n (if n is the total number of distinct fruits). Let's say the prices are large, so it's not obvious that you can "fill up" the basket with unit cost fruits. It can be hard in this situation to even find a single solution. Formulated as an ILP, this is equivalent to finding integer values of x_i such that:

sum_i (x_i*c_i) = x_1*c_1 + x_2*c_2 + ... + x_n*c_n = B 

and x_i >= 0 for all 1 <= i <= n (can't have negative fruits), and sum_i x_i <= 100 (can have at most 100 fruits).

The good news is that decent ILP solvers exist -- you can just hand over the above formulas and the solver will do its best to find a single solution. You can even add an "objective function" that the solver will maximize or minimize -- minimizing sum_i x_i has the effect of minimizing the total number of fruits in the basket. The bad news is that ILP is NP-complete, so there is almost no hope of finding an efficient solution for a large number of fruits (which equals the number of variables x_i).

I think the best approach forward is to try the ILP approach, but also introduce some more constraints on the scenario. For example, what if all fruits had a different prime number cost? This has the nice property that if you find one solution, you can enumerate a bunch of other related solutions. If an apple costs m and an orange costs n, where m and n are relatively prime, then you can "trade" n*x apples for m*x oranges without changing the basket total, for any integer x>0 (so long as you have enough apples and oranges to begin with). If you choose all fruits to have different prime number costs, then all of the costs will be pairwise relatively prime. I think this approach will result in relatively few solutions for a given day.

You might also consider other constraints, such as "there can't be more than 5 fruits of a single kind in the basket" (add the constraint x_i <= 5), or "there can be at most 5 distinct kinds of fruits in the basket" (but this is harder to encode as an ILP constraint). Adding these kinds of constraints will make it easier for the ILP solver to find a solution.

Of course the above discussion is focused on a single day, and you have multiple days' worth of data. If the hardest part of the problem is finding any solution for any day at all (which happens if your prices are large), then using an ILP solver will give you a large boost. If solutions are easy to find (which happens if you have a very-low-cost fruit that can "fill up" your basket), and the hardest part of the problem is finding solutions that are "consistent" across multiple days, then the ILP approach might not be the best fit, and in general this problem seems much more difficult to reason about.

Edit: and as mentioned in the comments, for some interpretations of the "10% change" constraint, you can even encode the entire multi-day problem as an ILP.

Answers 2

It seems to me like your approach is reasonable, but whether it is depends on the size of the numbers in the actual game. Here's a complete implementation that's a lot more efficient than yours (but still has plenty of scope for improvement). It keeps a list of possibilities for the previous day, and then filters the current day amounts to those that are within 5% of some possibility from the previous day, and prints them out per day.

def possibilities(value, prices):     for i in range(0, value+1, prices[0]):         for j in range(0, value+1-i, prices[1]):             k = value - i - j             if k % prices[2] == 0:                 yield i//prices[0], j//prices[1], k//prices[2]  def merge_totals(last, this, r):     ok = []     for t in this:         for l in last:             f = int(sum(l) * r)             if all(l[i] -f <= t[i] <= l[i] + f for i in range(len(l))):                 ok.append(t)                 break     return ok  days = [     (26, (1, 2, 3)),     (51, (2, 3, 4)),     (61, (2, 4, 5)), ]  ps = None for i, d in enumerate(days):     new_ps = list(possibilities(*d))     if ps is None:         ps = new_ps     ps = merge_totals(ps, new_ps, 0.05)      print('Day %d' % (i+1))     for p in ps:         print('apples: %s, pears: %s, oranges: %s' % p)     print 

Answers 3

Not an answer, but an attempt to make the one information about what "% change" might be supposed to mean (sum of change in count of each item computed backwards) more accessible to non-believers in pixel heaps:

        | Day 1        ! Day 2   change      ! Day 3   change      ! Day 4   change        |$/1|  # |  $  !$/1|  # |   %  |  $  !$/1|  # |   %  |  $  !$/1|  # |   %  |  $ Apples | 1 | 20 |  20 ! 2 | 21 | 4.76 |  42 ! 1 | 21 |  0   |  21 ! 1 | 22 | 4.55 |  22 Pears  | 2 | 43 |  86 ! 3 | 42 | 2.38 | 126 ! 2 | 43 | 2.33 |  86 ! 2 | 43 |  0   |  86 Oranges| 3 | 37 | 111 ! 5 | 36 | 2.78 | 180 ! 4 | 36 |  0   | 144 ! 3 | 35 | 2.86 | 105 Total  |    100 | 217 !    100 | 9.92 | 348 !    100 | 2.33 | 251 !    100 | 7.40 | 213 

Answers 4

Problem Framing

This problem can be described as a combinatorial optimization problem. You're trying to find an optimal object (a combination of fruit items) from a finite set of objects (all possible combinations of fruit items). With the proper analogy and transformations, we can reduce this fruit basket problem to the well known, and extensively studied (since 1897), knapsack problem.

Solving this class of optimization problems is NP-hard. The decision problem of answering "Can we find a combination of fruit items with a value of X?" is NP-complete. Since you want to account for a worst case scenario when you have thousands of fruit items, your best bet is to use a metaheuristic, like evolutionary computation.

Proposed Solution

Evolutionary computation is a family of biologically inspired metaheuristics. They work by revising and mixing (evolving) the most fit candidate solutions based on a fitness function and discarding the least fit ones over many iterations. The higher the fitness of a solution, the more likely it will reproduce similar solutions and survive to the next generation (iteration). Eventually, a local or global optimal solution is found.

These methods provide a needed compromise when the search space is too large to cover with traditional closed form mathematical solutions. Due to the stochastic nature of these algorithms, different executions of the algorithms may lead to different local optima, and there is no guarantee that the global optimum will be found. The odds are good in our case since we have multiple valid solutions.

Example

Let's use the Distributed Evolutionary Algorithms in Python (DEAP) framework and retrofit their Knapsack problem example to our problem. In the code below we apply strong penalty for baskets with 100+ items. This will severely reduce their fitness and have them taken out of the population pool in one or two generations. There are other ways to handle constraints that are also valid.

#    This file is part of DEAP. # #    DEAP is free software: you can redistribute it and/or modify #    it under the terms of the GNU Lesser General Public License as #    published by the Free Software Foundation, either version 3 of #    the License, or (at your option) any later version. # #    DEAP is distributed in the hope that it will be useful, #    but WITHOUT ANY WARRANTY; without even the implied warranty of #    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the #    GNU Lesser General Public License for more details. # #    You should have received a copy of the GNU Lesser General Public #    License along with DEAP. If not, see <http://www.gnu.org/licenses/>.  import random  import numpy as np  from deap import algorithms from deap import base from deap import creator from deap import tools  IND_INIT_SIZE = 5 # Calls to `individual` function MAX_ITEM = 100   # Max 100 fruit items in basket NBR_ITEMS = 50   # Start with 50 items in basket FRUIT_TYPES = 10 # Number of fruit types (apples, bananas, ...)  # Generate a dictionary of random fruit prices. fruit_price = {i: random.randint(1, 5)  for i in range(FRUIT_TYPES)}  # Create fruit items dictionary.  The key is item ID, and the  # value is a (weight, price) tuple.  Weight is always 1 here. items = {} # Create random items and store them in the items' dictionary. for i in range(NBR_ITEMS):     items[i] = (1, fruit_price[i])  # Create fitness function and an individual (solution candidate) # A solution candidate in our case is a collection of fruit items. creator.create("Fitness", base.Fitness, weights=(-1.0, 1.0)) creator.create("Individual", set, fitness=creator.Fitness)  toolbox = base.Toolbox()  # Randomly initialize the population (a set of candidate solutions) toolbox.register("attr_item", random.randrange, NBR_ITEMS) toolbox.register("individual", tools.initRepeat, creator.Individual,  toolbox.attr_item, IND_INIT_SIZE)   def evalBasket(individual):     """Evaluate the value of the basket and     apply constraints penalty.     """     value = 0 # Total value of the basket     for item in individual:         value += items[item][1]      # Heavily penalize baskets with 100+ items     if len(individual) > MAX_ITEM:         return 10000, 0     return len(individual), value  # (items in basket, value of basket)   def cxSet(ind1, ind2):     """Apply a crossover operation on input sets.     The first child is the intersection of the two sets,     the second child is the difference of the two sets.     This is one way to evolve new candidate solutions from     existing ones.  Think of it as parents mixing their genes     to produce a child.     """     temp = set(ind1)                # Used in order to keep type     ind1 &= ind2                    # Intersection (inplace)     ind2 ^= temp                    # Symmetric Difference (inplace)     return ind1, ind2   def mutSet(individual):     """Mutation that pops or add an element.     In nature, gene mutations help offspring express new traits     not found in their ancestors.  That could be beneficial or      harmful.  Survival of the fittest at play here.     """     if random.random() < 0.5:  # 50% chance of mutation         if len(individual) > 0:             individual.remove(random.choice(sorted(tuple(individual))))     else:         individual.add(random.randrange(NBR_ITEMS))     return individual,  # Register evaluation, mating, mutation and selection functions # so the framework can use them to run the simulation. toolbox.register("evaluate", evalKnapsack) toolbox.register("mate", cxSet) toolbox.register("mutate", mutSet) toolbox.register("select", tools.selNSGA2)   def main():     random.seed(64)     NGEN = 50     MU = 50     LAMBDA = 100     CXPB = 0.7     MUTPB = 0.2      pop = toolbox.population(n=MU)  # Initial population size     hof = tools.ParetoFront()    # Using Pareto front to rank fitness      # Keep track of population fitness stats which should      # improve over generations (iterations).     stats = tools.Statistics(lambda ind: ind.fitness.values)     stats.register("avg", numpy.mean, axis=0)     stats.register("std", numpy.std, axis=0)     stats.register("min", numpy.min, axis=0)     stats.register("max", numpy.max, axis=0)      algorithms.eaMuPlusLambda(pop, toolbox, MU,LAMBDA,\                               CXPB, MUTPB, NGEN, stats,\                               halloffame=hof)     return pop, stats, hof   if __name__ == "__main__":     main()    

Answers 5

You got a logic problem on integers, not a representation problem. Neural networks are relevant to problem with complex representation (eg., image with pixels, objects in differents shape and color, sometimes hidden etc), as they build their own set of features (descriptors) and mipmaps; they also are a good match with problems dealing with reals, not integer; and last, as they are today, they don't really deal with reasonning and logic, or eventually with simple logic like a small succession of if/else or switch but we don't really have a control over that.

What I see is closer to a cryptographic-ish problem with constraints (10% change, max 100 articles).

Solution for all sets of fruits

There is a way to reach all solutions very quickly. We start by factoring into primes the total, then we find few solutions through brute force. From there we can change the set of fruits with equal total. Eg., we exchange 1 orange for 1 apple and 1 pear with prices = (1,2,3). This way we can navigate through solutions without having to go through brute force.

Algorithm(s): you factorize in prime numbers the total, then you split them into two or more groups; let's take 2 groups: let A be one common multiplier, and let B the other(s). Then you can add your fruits to reach the total B.

Examples:

Day 1: Apple = 1, Pears = 2, Oranges = 3, Basket Value = 217

Day 2: Apple = 2, Pears = 3, Oranges = 5, Basket Value = 348

  • 217 factorizes into [7, 31], we pick 31 as A (common multiplier), then let say 7=3*2+1 (2 orange, 0 pear, 1 apple), you got an answer: 62 oranges, 0 pears, 31 apples. 62+31<100: valid.
  • 348 factorizes into [2, 2, 3, 29], you have several ways to group your factors and multiply your fruits inside this. The multiplier can be 29, (or 2*29 etc), then you pick your fruits to reach 12. Let's say 12=2*2+3+5. You got (2 apples, 1 pear, 1 orange) * 29, but it's more than 100 articles. You can fuse recursively 1 apple and 1 pear into 1 orange until you are below 100 articles, or you can go directly with the solution with a minimum of articles: (2 oranges, 1 apple)*29 = (58 oranges, 29 apples). And at last:

    -- 87<100 valid;

    -- the change is (-4 oranges, -2 apples), 6/93=6.45% <10% change: valid.

Code

Remark: no implementation of the 10% variation

Remark: I didn't implement the "fruit exchange" process that allows the "solution navigation"

Run with python -O solution.py to optimize and remove the debug messages.

def prime_factors(n):     i = 2     factors = []     while i * i <= n:         if n % i:             i += 1         else:             n //= i             factors.append(i)     if n > 1:         factors.append(n)     return factors  def possibilities(value, prices):     for i in range(0, value + 1, prices[0]):         for j in range(0, value + 1-i, prices[1]):             k = value - i - j             if k % prices[2] == 0:                 yield i//prices[0], j//prices[1], k//prices[2]  days = [     (217, (1, 2, 3)),     (348, (2, 3, 5)),     (251, (1, 2, 4)),     (213, (1, 2, 3)), ]  for set in days:     total = set[0]     (priceApple, pricePear, priceOrange) = set[1]      factors = prime_factors(total)     if __debug__:         print(str(total) + " -> " + str(factors))      # remove small article to help factorize (odd helper)     evenHelper = False     if len(factors) == 1 :         evenHelper = True         t1 = total - priceApple         factors = prime_factors(t1)         if __debug__:             print(str(total) + " --> " + str(factors))      # merge factors on left     while factors[0] < priceOrange :         factors = [factors[0] * factors[1]] + factors[2:]         if __debug__:             print("merging: " + str(factors))      # merge factors on right     if len(factors) > 2:         multiplier = 1         for f in factors[1:]:             multiplier *= f         factors = [factors[0]] + [multiplier]      (smallTotal, multiplier) = factors     if __debug__:         print("final factors: " + str(smallTotal) + " (small total) , " + str(multiplier) + " (multiplier)")      # solutions satisfying #<100     smallMax = 100 / multiplier     solutions = [o for o in possibilities(smallTotal, set[1]) if sum(o) < smallMax ]     for solution in solutions:         (a,p,o) = [i * multiplier for i in solution]          # if we used it, we need to add back the odd helper to reach the actual solution         if evenHelper:             a += 1          print(str(a) + " apple(s), " + str(p) + " pear(s), " + str(o) + " orange(s)")       # separating solutions     print() 

I timed the program with a 10037 total with (5, 8, 17) prices, and maximum 500 articles: it's about 2ms (on i7 6700k). The "solution navigation" process is very simple and shouldn't add significant time.

There might be a heuristic to go from day to day without having to do the factorization + navigation + validation process. I'll think about it.

Answers 6

Integer Linear Programming Approach

This sets up naturally as a multi-step Integer Program, with the holdings in {apples, pears, oranges} from the previous step in the calculation of the relative change in holdings that must be constrained. There is no notion of optimal here, but we can turn the "turnover" constraint into an objective and see what happens.

We in fact do get feasible integer solutions at each step, requiring smaller % changes to the basket than in your chart.

Comments -

  • I don't know how you calculated the "% change" column in your table. A change from Day 1 to Day 2 of 20 apples to 21 apples is a 4.76% change?

  • On all days, your total holdings in fruits is exactly 100. There is a constraint that the sum of holdings is <= 100. No violation, I just want to confirm.

We can set this up as an integer linear program, using the integer optimization routine from ortools. I haven't used an ILP solver for a long time, and this one is kind of flaky I think (the solver.OPTIMAL flag is never true it seems, even for toy problems. In addition the ortools LP solver fails to find an optimal solution in cases where scipy.linprog works without a hitch)

h1,d = holdings in apples (number of apples) at end of day d h2,d = holdings in pears at end of day d h3,d = holdings in oranges at end of day d 

I'll give two proposals here, one which minimizes the l1 norm of the absolute error, the other the l0norm.

The l1 solution finds the minimum of abs(h1,(d+1) - h1,d)/h1 + ... + abs(h3,(d+1) - h3,d)/h3), hoping that the constraint that each relative change in holdings is under 10% if the sum of the relative change in holdings is minimized.

The only thing that prevents this from being a linear program (aside from the integer requirement) is the nonlinear objective function. No problem, we can introduce slack variables and make everything linear. For the l1 formulation, 6 additional slack variables are introduced, 2 per fruit, and 6 additional inequality constraints. For the l0 formulation, 1 slack variable is introduced, and 6 additional inequality constraints.

This is a two step process, for example, replacing |apples_new - apples_old|/|apples_old| with the variable |e|, and adding inequality constraints to ensure the e measures what we'd like. We then replace|e| with (e+ - e-), each of e+, e- >0. It can be shown that one of e+, e- will be 0, and that (e+ + e-) is the absolute value of e. That way the pair (e+, e-) can represent a positive or negative number. Standard stuff, but that adds a bunch of variables and constraints. I can explain this in a bit more detail if necessary.

import numpy as np from ortools.linear_solver import pywraplp   def fruit_basket_l1_ortools():      UPPER_BOUND = 1000      prices = [[2,3,5],               [1,2,4],               [1,2,3]]      holdings = [20,43,37]      values = [348, 251, 213]      for day in range(len(values)):          solver = pywraplp.Solver('ILPSolver',                                   pywraplp.Solver.BOP_INTEGER_PROGRAMMING)         # solver = pywraplp.Solver('ILPSolver',         #                          pywraplp.Solver.CLP_LINEAR_PROGRAMMING)           c = ([1,1] * 3) + [0,0,0]          price = prices[day]         value = values[day]          A_eq = [[ 0,  0, 0,  0, 0,  0, price[0], price[1], price[2]]]          b_eq = [value]          A_ub = [[-1*holdings[0], 1*holdings[0],              0,             0,              0,              0,  1.0,    0,    0],                 [-1*holdings[0], 1*holdings[0],              0,             0,              0,              0, -1.0,    0,    0],                 [             0,             0, -1*holdings[1], 1*holdings[1],              0,              0,    0,  1.0,    0],                 [             0,             0, -1*holdings[1], 1*holdings[1],              0,              0,    0, -1.0,    0],                 [             0,             0,              0,              0, -1*holdings[2], 1*holdings[2],    0,    0,  1.0],                 [             0,             0,              0,              0, -1*holdings[2], 1*holdings[2],    0,    0, -1.0]]          b_ub = [1*holdings[0], -1*holdings[0], 1*holdings[1], -1*holdings[1], 1*holdings[2], -1*holdings[2]]          num_vars             = len(c)         num_ineq_constraints = len(A_ub)         num_eq_constraints   = len(A_eq)                          data = [[]] * num_vars          data[0]  = solver.IntVar(           0, UPPER_BOUND, 'e1_p')         data[1]  = solver.IntVar(           0, UPPER_BOUND, 'e1_n')         data[2]  = solver.IntVar(           0, UPPER_BOUND, 'e2_p')         data[3]  = solver.IntVar(           0, UPPER_BOUND, 'e2_n')         data[4]  = solver.IntVar(           0, UPPER_BOUND, 'e3_p')         data[5]  = solver.IntVar(           0, UPPER_BOUND, 'e3_n')         data[6]  = solver.IntVar(           0, UPPER_BOUND, 'x1')         data[7]  = solver.IntVar(           0, UPPER_BOUND, 'x2')         data[8]  = solver.IntVar(           0, UPPER_BOUND, 'x3')          constraints = [0] * (len(A_ub) + len(b_eq))          # Inequality constraints         for i in range(0,num_ineq_constraints):             constraints[i] = solver.Constraint(-solver.infinity(), b_ub[i])             for j in range(0,num_vars):                 constraints[i].SetCoefficient(data[j], A_ub[i][j])          # Equality constraints         for i in range(num_ineq_constraints, num_ineq_constraints+num_eq_constraints):             constraints[i] = solver.Constraint(b_eq[i-num_ineq_constraints], b_eq[i-num_ineq_constraints])             for j in range(0,num_vars):                 constraints[i].SetCoefficient(data[j], A_eq[i-num_ineq_constraints][j])          # Objective function         objective = solver.Objective()         for i in range(0,num_vars):             objective.SetCoefficient(data[i], c[i])          # Set up as minization problem         objective.SetMinimization()          # Solve it         result_status = solver.Solve()          solution_set = [data[i].solution_value() for i in range(len(data))]          print('DAY: {}'.format(day+1))         print('======')         print('SOLUTION FEASIBLE: {}'.format(solver.FEASIBLE))         print('SOLUTION OPTIMAL:  {}'.format(solver.OPTIMAL))         print('VALUE OF BASKET:   {}'.format(np.dot(A_eq[0], solution_set)))         print('SOLUTION   (apples,pears,oranges): {!r}'.format(solution_set[-3:]))         print('PCT CHANGE (apples,pears,oranges): {!r}\n\n'.format([round(100*(x-y)/y,2) for x,y in zip(solution_set[-3:], holdings)]))          # Update holdings for the next day         holdings = solution_set[-3:] 

A single run gives:

DAY: 1 ====== SOLUTION FEASIBLE: 1 SOLUTION OPTIMAL:  0 VALUE OF BASKET:   348.0 SOLUTION   (apples,pears,oranges): [20.0, 41.0, 37.0] PCT CHANGE (apples,pears,oranges): [0.0, -4.65, 0.0]   DAY: 2 ====== SOLUTION FEASIBLE: 1 SOLUTION OPTIMAL:  0 VALUE OF BASKET:   251.0 SOLUTION   (apples,pears,oranges): [21.0, 41.0, 37.0] PCT CHANGE (apples,pears,oranges): [5.0, 0.0, 0.0]   DAY: 3 ====== SOLUTION FEASIBLE: 1 SOLUTION OPTIMAL:  0 VALUE OF BASKET:   213.0 SOLUTION   (apples,pears,oranges): [20.0, 41.0, 37.0] PCT CHANGE (apples,pears,oranges): [-4.76, 0.0, 0.0] 

The l0 formulation is also presented:

def fruit_basket_l0_ortools():      UPPER_BOUND = 1000      prices = [[2,3,5],               [1,2,4],               [1,2,3]]      holdings = [20,43,37]      values = [348, 251, 213]      for day in range(len(values)):          solver = pywraplp.Solver('ILPSolver',                                   pywraplp.Solver.BOP_INTEGER_PROGRAMMING)         # solver = pywraplp.Solver('ILPSolver',         #                          pywraplp.Solver.CLP_LINEAR_PROGRAMMING)          c = [1, 0, 0, 0]          price = prices[day]         value = values[day]          A_eq = [[0, price[0], price[1], price[2]]]          b_eq = [value]          A_ub = [[-1*holdings[0],  1.0,    0,    0],                 [-1*holdings[0], -1.0,    0,    0],                 [-1*holdings[1],    0,  1.0,    0],                 [-1*holdings[1],    0, -1.0,    0],                 [-1*holdings[2],    0,    0,  1.0],                 [-1*holdings[2],    0,    0, -1.0]]           b_ub = [holdings[0], -1*holdings[0], holdings[1], -1*holdings[1], holdings[2], -1*holdings[2]]           num_vars             = len(c)         num_ineq_constraints = len(A_ub)         num_eq_constraints   = len(A_eq)                          data = [[]] * num_vars          data[0]  = solver.IntVar(-UPPER_BOUND, UPPER_BOUND, 'e' )         data[1]  = solver.IntVar(           0, UPPER_BOUND, 'x1')         data[2]  = solver.IntVar(           0, UPPER_BOUND, 'x2')         data[3]  = solver.IntVar(           0, UPPER_BOUND, 'x3')          constraints = [0] * (len(A_ub) + len(b_eq))          # Inequality constraints         for i in range(0,num_ineq_constraints):             constraints[i] = solver.Constraint(-solver.infinity(), b_ub[i])             for j in range(0,num_vars):                 constraints[i].SetCoefficient(data[j], A_ub[i][j])          # Equality constraints         for i in range(num_ineq_constraints, num_ineq_constraints+num_eq_constraints):             constraints[i] = solver.Constraint(int(b_eq[i-num_ineq_constraints]), b_eq[i-num_ineq_constraints])             for j in range(0,num_vars):                 constraints[i].SetCoefficient(data[j], A_eq[i-num_ineq_constraints][j])          # Objective function         objective = solver.Objective()         for i in range(0,num_vars):             objective.SetCoefficient(data[i], c[i])          # Set up as minization problem         objective.SetMinimization()          # Solve it         result_status = solver.Solve()          solution_set = [data[i].solution_value() for i in range(len(data))]          print('DAY: {}'.format(day+1))         print('======')         print('SOLUTION FEASIBLE: {}'.format(solver.FEASIBLE))         print('SOLUTION OPTIMAL:  {}'.format(solver.OPTIMAL))         print('VALUE OF BASKET:   {}'.format(np.dot(A_eq[0], solution_set)))         print('SOLUTION   (apples,pears,oranges): {!r}'.format(solution_set[-3:]))         print('PCT CHANGE (apples,pears,oranges): {!r}\n\n'.format([round(100*(x-y)/y,2) for x,y in zip(solution_set[-3:], holdings)]))          # Update holdings for the next day         holdings = solution_set[-3:] 

A single run of this gives

DAY: 1 ====== SOLUTION FEASIBLE: 1 SOLUTION OPTIMAL:  0 VALUE OF BASKET:   348.0 SOLUTION   (apples,pears,oranges): [33.0, 79.0, 9.0] PCT CHANGE (apples,pears,oranges): [65.0, 83.72, -75.68]   DAY: 2 ====== SOLUTION FEASIBLE: 1 SOLUTION OPTIMAL:  0 VALUE OF BASKET:   251.0 SOLUTION   (apples,pears,oranges): [49.0, 83.0, 9.0] PCT CHANGE (apples,pears,oranges): [48.48, 5.06, 0.0]   DAY: 3 ====== SOLUTION FEASIBLE: 1 SOLUTION OPTIMAL:  0 VALUE OF BASKET:   213.0 SOLUTION   (apples,pears,oranges): [51.0, 63.0, 12.0] PCT CHANGE (apples,pears,oranges): [4.08, -24.1, 33.33] 

Summary

The l1 formulation gives more sensible results, lower turnover, much lower. The optimality check fails on all runs, however, which is concerning. I included a linear solver too and that fails the feasiblity check somehow, I don't know why. The Google people provide precious little documentation for the ortools lib, and most of it is for the C++ lib. But the l1 formulation may be a solution to your problem, which may scale. ILP is in general NP-complete, and so is your problem most likely.

Also - does a solution exist on day 2? How do you define % change so that it does in your chart above? If I knew I could recast the inequalities above and we would have the general solution.

Read More

Friday, March 9, 2018

Hive Query Efficiency

Leave a Comment

Could you help me with a Hive Query Efficiency problem? I have two queries working for the same problem. I just cannot figure out why one is much faster than the other. If you know please feel free to provide insight. Any info is welcomed!

Problem: I am trying to check the minimum value of a bunch of variables in a Hive parquet table.

Queries: I tried two two queries as follows:

query 1

drop table if exists tb_1 purge; create table if not exists tb_1 as select 'v1' as name, min(v1) as min_value from src_tb union all select 'v2' as name, min(v2) as min_value from src_tb union all select 'v3' as name, min(v3) as min_value from src_tb union all ... select 'v200' as name, min(v200) as min_value from src_tb ; 

query 2

drop table if exists tb_2 purge; create table if not exists tb_2 as select min(v1) as min_v1 , min(v2) as min_v2 , min(v3) as min_v3 ... , min(v200) as min_v200 from src_tb ; 

Result: Query 2 is much faster than query 1. It took probably 5 mins to finish the second query. I don't know how long will query 1 take. But after I submit the first query, it took a long time to even react to the query, by which I mean that usually after I submit a query, the system will start to analyze and provides some compiling information in the terminal. However, for my first query, after my submission, the system won't even react to this. So I just killed it.

What do you think? Thank you in advance.

3 Answers

Answers 1

Query execution time depends on environment that you execute it.

In MSSQL.

Some people like you think query execution is similar to algorithm that they see in some theoretical resources, but in practical situation, it depends on other things.

For example both of your queries have SELECT statement that perform on a table and at first glance, they need to read all rows, but database server must analyze the statement to determine the most efficient way to extract the requested data. This is referred to as optimizing the SELECT statement. The component that does this is called the Query Optimizer. The input to the Query Optimizer consists of the query, the database schema (table and index definitions), and the database statistics. The output of the Query Optimizer is a query execution plan, sometimes referred to as a query plan or just a plan. (Please see this for more information about query-processing architecture)

You can see execution plan in MSSQL by reading this article and I think you will understand better by seeing execution plan for both of your queries.

Edit (Hive)

Hive provides an EXPLAIN command that shows the execution plan for a query. The syntax for this statement is as follows:

EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] query 

A Hive query gets converted into a sequence of stages. The description of the stages itself shows a sequence of operators with the metadata associated with the operators.

Please see LanguageManual Explain for more information.

Answers 2

What is surprising? The first query has to read src_tb a total of 200 times. The second reads the data once and performs 200 aggregations. It is a no brainer that it is faster.

Answers 3

As mentioned, source table is in parquet. Hive leverages columnar approach, but the latency caused while executing your query1 is due to scheduling delay.

Here's the breakdown.

Query 1 :

Since It is a Union All operation of 200 queries, when you run EXPLAIN on your query1, you should probably see roughly 202-205 stages in the execution plan.However resource manager will keep you waiting until it finds resources to schedule roughly about 200+ stages, in some cases application master may schedule/run multiple stages at once.Even though application master tries to run multiple stages in parallel, the scheduling delay is relatively higher compared to query2. This was the reason why you didn't see any response on your hive client when you submitted query 1. Coming columnar approach, In query1 it is definitely an optimization since you are selecting a single column and your source table is in parquet, but the scheduling delay out weighs in you case.

Query 2:

As opposed to query 1, when you run EXPLAIN on the query2 you may see 3 stages, if not fewer than 5 stages.However there is no scheduling delay in this case.This is the reason why you are seeing the results quick.Moving on to columnar format, in case you are selecting most of the columns in your source table, column/row orientation may not make any difference. Otherwise it is optimal as opposed to row oriented.

Read More