Below is my application database table which contains SQL queries stored in a table: QueryStorage
Id Query ConnectionString Rdbms 1 select... Data Source Sql Server 2 select... Data Source Oracle The SQL queries in the above table are updated through web service and we are not allowed to update above the queries, though we can add something on top of the query something like this:
Query stored in table: select id as LinkedColumn, Amount as CompareColumn from Source
Tweaked query from my c# app:select Q.LinkedColumn, Q.CompareColumn from (stored sql query) as Q
I am trying to compare 2 unordered list like below:
Query executed for Id = 1(Sql server) from QueryStorage table records are like below:
select Id as LinkedColumn,CompareColumn from Source List 1:
LinkedColumn CompareColumn 1 100 2 200 3 300 4 400 5 500 6 600 7 700 8 800 9 900 10 1000 Query executed for Id = 2(Oracle) from QueryStorage table records are like below:
select Id as LinkedColumn,CompareColumn from Target List 2:
LinkedColumn CompareColumn 10 10 9 20 8 30 7 40 6 50 5 60 4 70 3 80 2 90 1 5 I want to join on LinkedColumn from source to target and then do comparison on CompareColumn which should give me following output:
SrcLinkedColumn SrcCompareColumn TgtLinkedColumn TgtCompareColumn 1 100 1 5 2 200 2 90 Logic:
var data = (from s in List1.AsEnumerable() join t in List2.AsEnumerable() on s.Field<string>("LinkedColumn") equals t.Field<string>("LinkedColumn") where s.Field<decimal>("CompareColumn") != t.Field<decimal>("CompareColumn") select new { srcLinkedcol = s.Field<string>("LinkedColumn"), srcCompareCol = s.Field<decimal>("CompareColumn"), tgtLinkedCol = t.Field<string>("LinkedColumn"), tgtCompareCol = t.Field<decimal>("CompareColumn") }).ToList(); There will be millions of records from source to target which I want to compare with so big issue is of out of memory exception which we are facing right now by loading all data in memory and then doing comparison with above linq query.
There are 2 solutions which I have thought of like below:
1) Open 2 ordered data reader.
Pros :
- No memory exception - Fast as there will be 1 to 1 comparision of LinkedColumn for List1 and List2 records. Cons :
- Order by is require on LinkedColumns and as i have no control over query as because it is dumped by webservice in QueryStorage table so user is explicitly require to submit query with order by on LinkedColumn. - Wont work if order by on Linkedcolumn is not present. - Order by query have performance overhead so because of this user may not include order by on LinkedColumn in query. 2) Compare chunk by chunk records by tweaking query and adding OffSet and FetchRowNext. This is how I am thinking for algorithm:
But i still feel that with second approach i can get memory exception issue because at some steps where data from source and target are not matching i will be storing them inside buffer(datatable or list etc..) for next chunk comparision.
Can anybody please guide me what should be the good algorithm for this or any better way to address this?
Note : I dont want to use LinkedServer and SSIS.
7 Answers
Answers 1
24 million rows with two numbers each (8 bytes) is only ~200MB of memory. Two datasets are 400MB. This is nothing on the modern desktop hardware.
Load both datasets in memory. In simple arrays. Sort arrays in memory. Find the difference by scanning both sorted arrays once. You don't need fancy LINQ for that.
Here is pseudocode. We have Array1 and Array2. Maintain two variables that contain the current index of each array: Idx1 and Idx2. Move the indexes along the arrays looking for places where LinkedColumn match.
Idx1 = 0; Idx2 = 0; while (true) { // scan arrays until both indexes point to the same LinkedColumn // here we assume that both arrays are sorted by LinkedColumn // and that values in LinkedColumn are unique while (Idx1 < Array1.Length && Idx2 < Array2.Length && Array1[Idx1].LinkedColumn < Array2[Idx2].LinkedColumn) { // move along Array1 Idx1++; } while (Idx1 < Array1.Length && Idx2 < Array2.Length && Array1[Idx1].LinkedColumn > Array2[Idx2].LinkedColumn) { // move along Array2 Idx2++; } // at this point both indexes point to the same LinkedColumn // or one or both of the arrays are over if (Idx1 >= Array1.Length || Idx2 >= Array2.Length) { break; } // compare the values if (Array1[Idx1].CompareColumn != Array2[Idx2].CompareColumn) { // TODO: output/save/print the difference } Idx1++; Idx2++; } OR
you can dump both datasets in a database of your choice in two tables T1 and T2, create unique index on LinkedColumn in both tables and run this query:
SELECT T1.LinkedColumn AS SrcLinkedColumn ,T1.CompareColumn AS SrcCompareColumn ,T2.LinkedColumn AS DstLinkedColumn ,T2.CompareColumn AS DstCompareColumn FROM T1 INNER JOIN T2 ON T1.LinkedColumn = T2.LinkedColumn WHERE T1.CompareColumn <> T2.CompareColumn ORDER BY T1.LinkedColumn ; The pseudocode above performs the same merge join as the DBMS server would perform for this query.
Answers 2
If I understood correctly, you get tables data in C# by somethings like DataReader. If you store records that fetch from web service and then execute some query as @VladimirBaranov mentioned, it need too long time for storing and it not reasonable.
I think a better solution for comparing in this case is Binary Search. Time order of it is O(log n) and Memory order of it is O(1). So it doesn't need to extra memory.
You can use Binary search like below:
1- Sorting smaller table.
Time order: If this table has n element, on average, executing time is O(nlogn) (more information).
Memory order: O(1), because sort method in C# is In-place.
2- Comparing another table records with sorted table by Binary search.
Time order: If another table has m records, executing time is O(m)*O(logn) = O(mlogn). (more information)
Memory order: It needs no extra memory.
To conclusion, I thank may be best solution for your issue, but I think this solution is good solution that has O(mlogn) executing time, O(1) memory order and it implements only in C#, no need to save records in database.
Answers 3
You probably get a out of memory due to a memory fragmentation.
- Ensure that you use an efficient way to read your data like a DataReader.
- To avoid fragmentation, get the count of your records and initialize the List 1 array with this count before starting to insert in the array. If you cannot get the count you can increase the grow factor of your list or build a query that will contains the original query as a sub query like this: SELECT COUNT(*) FROM ( [YOUR ORIGINAL QUERY HERE] )
- Build an array of struct that contains your fields or a 2 dimensions array, this will not require a reference for each rows.
- Also, you can store the original (List 1) data in your destination format. This way you will only have one array in memory.
Here is the pseudo code:
- Get row count of List1:
- Create a 2 dimensions array to hold data of list1
- Get data from List1 and add those record to your array
- Dispose you first datareader
- Use an in place sort to sort list1 data (this way you will be able to use binary search to find list1 row).
- Get data from List2
- Foreach row of List2 use a binary search to find corresponding record in list1: Add compare value of List2 to the matching row (remember your array already contains your 3 columns). You only need to set the compare value of list2).
- Cleanup you result by removing rows that do not have match. I recommend you to cleanup using the same array. With two index you can scan you array and move up record in the array.
Answers 4
I'm sorry but I can only give you an idea :
Constraints :
- Target has millions of records
- Source has millions of records
- Memory is limited
The idea :
- Use at least 2 machines for this
- One as the [Source], and the other one as the [Target] which has the matching service
- Send every n (chunk) records from [Source] to [Target] machine by calling the service, let it match the records and then send back the result to [Source]
Bonus :
- Use more than 1 [Target] machines, and let the [Source] machine utilize them all in asynchronous and load balanced scenario
There are some similar (or advance) solutions you may find, like :
- [Target] and [Source] may utilize a shared memory (a cache server, for example) to store the result
- using a real load-balancer layer between them, to handle multiple requests to multiple [Target] machines
- using a queue/messaging layer
- even using multiple [Source] with multiple [Target] machines
I'm sorry but (at least for me) this case is too big too handle by using 1 machine only
I hope at least this could give you an idea
Answers 5
What about removing the elements that match? Something like:
List1.RemoveAll(x => List2.Any(y => y.LinkedColumn == x.LinkedColumn && y.CompareColumn == x.CompareColumn)) Answers 6
As it seems LinkedColumn contains unique values on each source, I would load both data in two tables in a local database, create an index on LinkedColumn and do a join.
Or you can load both data in a simple file, including a source identifier on each row as follows (don't include the header):
LinkedColumn Identifier CompareColumn 1 S 100 2 S 200 3 S 300 ... 4 O 70 3 O 80 2 O 90
S stands for SQL Server and O for Oracle. Sort the file (maybe you can run the OS sort or another external sort). Read the sorted file line by line. The first line must be stored so you'll compare its LinkedColumn with the second line for a match, thus collecting them or storing the second one and dropping the first to look for a match on the third and so on.
Hope it helps.
Answers 7
Just leave this job to SQL Server.
Let it process WHERE CLAUSE
T1.CompareColumn <> T2.CompareColumn

0 comments:
Post a Comment