Showing posts with label bigdata. Show all posts
Showing posts with label bigdata. Show all posts

Sunday, May 13, 2018

Map Reduce Job to find the popular items in a time window

Leave a Comment

I was asked this question in an interview, and I'm not sure if I gave the proper answer, so I would like some insights.

The problem: There is a stream of users and items. At each minute, I receive a list of tuples (user, item), representing that a user u consumed item i. I need to find the top 100 popular items in the past hour, i.e., calculate how many users consumed each item and sort them. The trick here is that in the past hour, if an item is consumed by the same user more than once, only 1 consumption is considered. No repeated consumption by the same user is allowed. The interviewer said that I should think big and there would be millions of consumptions per hour. So, he suggested me to do a map-reduce job or something that can deal with this large amount of data per minute.

The solution I came up with: I said that I could maintain a list (or a matrix if you prefer) of the consumed user-item-timestamp tuples, as if there was a time-window shifting. Something like:

  • u1,i1,t1
  • u1,i2,t1
  • u2,i2,t2... and so on.

At each minute, when I receive the stream of user-items consumption for this minute, I first make a map-reduce job to update the time-window matrix, with the current timestamp. This map-reduce job could be done by two mappers (one for the stream and the other for the time-window list), and the reducer would simply get the maximum for each pair. A pseudo-code for what I did:

mapTimeWindow(line):     user, item, timestamp = line.split(" ")     context.write(key=(user,item), value=timestamp)  mapStream(line):     user, item = line.split(" ")     context.write(key=(user,item), value=now())  reducer(key, list):     context.write(key=(user,item), value=max(list)) 

Next, I also do a map-reduce to calculate the popularity by calculating the times that each user appear in that list. My map reads for the updated time window list and write item and 1. The reducer calculates the sum of the list for each item. Since I am storing all the timestamp, I verify if the consumption is in the past hour or not. Another map-reduce pseudo-code:

mapPopularity(line):     user, item, timestamp = line.split(" ")     if now()-60>timestamp:          return     context.write(key=item, value=1) # no repetition  reducerPopularity(key, list):     context.write(key=item, value=sum(list)) 

Later we can do another map-reduce to read from the result of the second job and calculate the top100 largest items. Something done like this.

My question: is this solution acceptable for the interview I had? It contains three map-reduces to solve the problem. However, it seems to me to be quite a lot to execute at each minute. Since it needs to be updated at every minute, it cannot last longer than that. I mean, I put quite a lots of efforts into it, but the interviewer didn't give me a feedback if it is right or not. I would to know: is it possible to make it faster? Or is it possible to deal with this in another way? (maybe not map-reduce)

1 Answers

Answers 1

Telling if your solution is acceptable or not, is ultimately an opinion. The interviewer could appreciate your algorithm or perhaps your problem solving process and your thinking. Only your interviewer can ultimately tell. Your solution certaintly follows a logic and does the job, if the algorithm you wrote is implemented in a complete and correct way.

My solution:

As you explained, the main concern is performance, since we have big data, so we shall reduce space complexity, time complexity and number of executions by keeping it to the least amount necessary.

Space complexity

I would keep one list of [user,timestamp] per item (or more performant collection depending on the libraries you use but I will keep it base-case here. See dict note at the end). Every new item has its own list. This essentially is better than an overall [user, timestamp,item] because that is worse in memory usage due to the extra field and requiring an additional map operation or maybe just filtering because you have to process all associations existing to extract those "per item". More easily, you can get the list for that item "by hash" or by reference in the code. This model is the minimalistic one.

Time complexity

That said, there is the purge operation and the popularity extraction. Since we want to limit hits, but you must check timestamps every time you calculate current popularity due to specifics, you must scroll your list requiring complexity of O(n). Therefore: Filter by current time <60 the way you did. This will purge expired associations. Then simply len(list_of_that_item). Complexity O(1). Done.

Since the linear search cost is paid by the filtering, a reduce operation would pay a similar cost if you want to count the non expired entries without purging. If and only if deleting from the list has a bigger overhead, you may want to benchmark a non-deleting algorithm that keeps associations "forever" and you manually schedule purging operations. Although the previous solution should perform better, it is worth mentioning for completeness.

Insertion

If you use dicts it's trivial (and more performant too). Updating the timestamp or inserting if not present are the same code: strawberry["Mike"]=timestamp. Moreover the overall associations set is a dict with key=item and value=per_item_dict and per_item_dict has key=user value=timestamp. Therefore data[strawberry]["Mike"]=timestamp

Read More

Saturday, December 16, 2017

how to load json file greater than 10gb in pandas/python of a particular pattern

Leave a Comment

I have a json file of 11gb and I'm unable to load it in pandas. (Source: http://jmcauley.ucsd.edu/data/amazon/) Metadata within the above link is the file that I'm using.

Metadata: Metadata includes descriptions, price, sales-rank, brand info, and co-purchasing links:

It has the following pattern -

{ "asin": "0000031852", "title": "Girls Ballet Tutu Zebra Hot Pink", "price": 3.17, "imUrl": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg", "related": { "also_bought": ["B00JHONN1S", "B002BZX8Z6", "B00D2K1M3O", "0000031909", "B00613WDTQ", "B00D0WDS9A", "B00D0GCI8S", "0000031895", "B003AVKOP2", "B003AVEU6G", "B003IEDM9Q", "B002R0FA24", "B00D23MC6W", "B00D2K0PA0", "B00538F5OK", "B00CEV86I6", "B002R0FABA", "B00D10CLVW", "B003AVNY6I", "B002GZGI4E", "B001T9NUFS", "B002R0F7FE", "B00E1YRI4C", "B008UBQZKU", "B00D103F8U", "B007R2RM8W"], "also_viewed": ["B002BZX8Z6", "B00JHONN1S", "B008F0SU0Y", "B00D23MC6W", "B00AFDOPDA", "B00E1YRI4C", "B002GZGI4E", "B003AVKOP2", "B00D9C1WBM", "B00CEV8366", "B00CEUX0D8", "B0079ME3KU", "B00CEUWY8K", "B004FOEEHC", "0000031895", "B00BC4GY9Y", "B003XRKA7A", "B00K18LKX2", "B00EM7KAG6", "B00AMQ17JA", "B00D9C32NI", "B002C3Y6WG", "B00JLL4L5Y", "B003AVNY6I", "B008UBQZKU", "B00D0WDS9A", "B00613WDTQ", "B00538F5OK", "B005C4Y4F6", "B004LHZ1NY", "B00CPHX76U", "B00CEUWUZC", "B00IJVASUE", "B00GOR07RE", "B00J2GTM0W", "B00JHNSNSM", "B003IEDM9Q", "B00CYBU84G", "B008VV8NSQ", "B00CYBULSO", "B00I2UHSZA", "B005F50FXC", "B007LCQI3S", "B00DP68AVW", "B009RXWNSI", "B003AVEU6G", "B00HSOJB9M", "B00EHAGZNA", "B0046W9T8C", "B00E79VW6Q", "B00D10CLVW", "B00B0AVO54", "B00E95LC8Q", "B00GOR92SO", "B007ZN5Y56", "B00AL2569W", "B00B608000", "B008F0SMUC", "B00BFXLZ8M"], "bought_together": ["B002BZX8Z6"] }, "salesRank": {"Toys & Games": 211836}, "brand": "Coxlures", "categories": [["Sports & Outdoors", "Other Sports", "Dance"]] } 

On the other hand the 1st 10 rows have the following data:

["{'asin': '0001048791', 'salesRank': {'Books': 6334800}, 'imUrl': 'http://ecx.images-amazon.com/images/I/51MKP0T4DBL.jpg', 'categories': [['Books']], 'title': 'The Crucible: Performed by Stuart Pankin, Jerome Dempsey &amp; Cast'}\n", "{'asin': '0000143561', 'categories': [['Movies & TV', 'Movies']], 'description': '3Pack DVD set - Italian Classics, Parties and Holidays.', 'title': 'Everyday Italian (with Giada de Laurentiis), Volume 1 (3 Pack): Italian Classics, Parties, Holidays', 'price': 12.99, 'salesRank': {'Movies & TV': 376041}, 'imUrl': 'http://g-ecx.images-amazon.com/images/G/01/x-site/icons/no-img-sm._CB192198896_.gif', 'related': {'also_viewed': ['B0036FO6SI', 'B000KL8ODE', '000014357X', 'B0037718RC', 'B002I5GNVU', 'B000RBU4BM'], 'buy_after_viewing': ['B0036FO6SI', 'B000KL8ODE', '000014357X', 'B0037718RC']}}\n", "{'asin': '0000037214', 'related': {'also_viewed': ['B00JO8II76', 'B00DGN4R1Q', 'B00E1YRI4C']}, 'title': 'Purple Sequin Tiny Dancer Tutu Ballet Dance Fairy Princess Costume Accessory', 'price': 6.99, 'salesRank': {'Clothing': 1233557}, 'imUrl': 'http://ecx.images-amazon.com/images/I/31mCncNuAZL.jpg', 'brand': 'Big Dreams', 'categories': [['Clothing, Shoes & Jewelry', 'Girls'], ['Clothing, Shoes & Jewelry', 'Novelty, Costumes & More', 'Costumes & Accessories', 'More Accessories', 'Kids & Baby']]}\n", "{'asin': '0000032069', 'title': 'Adult Ballet Tutu Cheetah Pink', 'price': 7.89, 'imUrl': 'http://ecx.images-amazon.com/images/I/51EzU6quNML._SX342_.jpg', 'related': {'also_bought': ['0000032050', 'B00D0DJAEG', '0000032042', 'B00D0F450I', 'B00D2JTMS2', 'B00D0FDUAY', 'B00D2JSRFQ', '0000032034', 'B00D0D5F6S', 'B00D2JRWWA', 'B00D0FIIJM', 'B00D0FCQQI', 'B00EXVN9PU', 'B0041EOTJO', 'B004PYEE8G', 'B001GTKPDQ', 'B00EON0SJ2', 'B005HMHOQ4', 'B002XZMGGQ'], 'also_viewed': ['B00D0F450I', '0000032050', 'B00D2JTMS2', '0000032042', 'B004PYEE8G', 'B00JHNSNSM', 'B00D0DJAEG', 'B00D2JSRFQ', 'B00D0FCQQI', 'B00D2JRWWA', 'B003AVNY6I', 'B0071KR2LC', 'B00GOR07RE', 'B00D0FIIJM', 'B005F50FXC', 'B0079MCIMU', 'B00D0FDUAY', 'B00H3RYN3I', 'B005C4Y4F6', 'B007IEFT84', 'B00D0D5F6S', 'B002BZX8Z6', 'B00JHONN1S', 'B008F0SU0Y', 'B00FNNFXAG', 'B007R2RM8W', 'B007VM3AMK', 'B00C0PLENA', 'B00BJGG6VG', 'B00E1YRI4C', 'B00IIK61WA', 'B009UC638W', 'B00KZN6RVI', 'B00CSFEENY', 'B002GZGI4E', 'B00HSOJJ94', 'B00LIPP4VG', 'B009RXWNSI', 'B00E87F196', 'B005HMHOQY', 'B00J6S9MSS', '0000032034', 'B00CJQGNJK', 'B008FCA0F0', 'B0056LG7GY', 'B00DPQWCZ2', 'B00I3PV0US', 'B00KZN6IVW', 'B0054TBWKO', 'B00I2S01I8', 'B00BXF12P8', 'B00GVHU678', 'B005NWENGC', 'B003AVKOP2', 'B00JK8MQ4Q', 'B00FZIMVQS', 'B008BB19VE', 'B00GTEXPOE', 'B009WPT2RQ', 'B00E37SBBG'], 'bought_together': ['0000032050', 'B00D0DJAEG', '0000032042', 'B00D0F450I']}, 'brand': 'BubuBibi', 'categories': [['Sports & Outdoors', 'Other Sports', 'Dance', 'Clothing', 'Girls', 'Skirts']]}\n", "{'asin': '0000031909', 'related': {'also_bought': ['B002BZX8Z6', 'B00JHONN1S', '0000031895', 'B00D2K1M3O', '0000031852', 'B00D0WDS9A', 'B00D10CLVW', 'B00D103F8U', 'B003AVEU6G', 'B00D2K0PA0', 'B002GZGI4E', 'B00D0ZF44Y', 'B008F0SMUC', 'B00D0GCI8S', 'B008F0SU0Y', 'B002YSCPZY', '0448408775', 'B002R0FABA', 'B008GHWNWC', 'B002R0FA24', 'B001GTKPEK', 'B006XA7KZO', 'B001GZUQ9S', 'B00613VNL0', 'B003IEDM9Q', 'B003LTOZK8', 'B003AVNY6I', 'B008UBQZKU', 'B001AQD8VQ', 'B003ILA0L2', 'B00AFDOPDA', 'B002R0F7FE'], 'also_viewed': ['B002BZX8Z6', 'B00JHONN1S', 'B008F0SU0Y', 'B00E1YRI4C', 'B00AFDOPDA', 'B002GZGI4E', 'B00CEUWY8K', 'B003IEDM9Q', 'B00HSOJB9M', '0000031895', 'B003AVKOP2', 'B005C4Y4F6', 'B008F0SMUC', 'B00362QGW0', 'B008UD01L2', 'B00FAZ5ZE6', 'B008F0SY6O', 'B00DPLLQR2', 'B00CEUWUZC', 'B004PYEE8G', 'B003AVNY6I', 'B00CEUX0D8', 'B00JHNSNSM', 'B00D10CLVW', 'B00D23MC6W', 'B007XAI53E', 'B008X6CBS2', 'B004PEI45U', 'B005HMHOQ4', 'B002C3Y6WG', 'B00HSC8O74', 'B008BMGHM4', 'B00CEUWTFS', 'B00FNNFXAG', 'B00CYBU84G', 'B00D9C32NI', 'B0046W9T8C', 'B008UBG5IW', 'B001YHX45G', 'B00CEV8366', 'B00I2UHSZA', 'B009RXWNSI', 'B008FCA0F0', 'B001GTKPEK', 'B004TU1VPU', 'B00CBPIO7S', 'B00CHHXJ0M', 'B00538F5OK', 'B005F50FXC', 'B00CEUX4QQ', 'B003XRKA7A', '0000031852', 'B002C3R5XI', 'B00C6Q1Z6E'], 'bought_together': ['B002BZX8Z6']}, 'title': 'Girls Ballet Tutu Neon Pink', 'price': 7.0, 'salesRank': {'Toys & Games': 201847}, 'imUrl': 'http://ecx.images-amazon.com/images/I/41xBoP0FVzL._SY300_.jpg', 'brand': 'Unknown', 'categories': [['Sports & Outdoors', 'Other Sports', 'Dance']], 'description': 'High quality 3 layer ballet tutu. 12 inches in length'}\n", "{'asin': '0000032034', 'title': 'Adult Ballet Tutu Yellow', 'price': 7.87, 'imUrl': 'http://ecx.images-amazon.com/images/I/21GNUNIa1CL.jpg', 'related': {'also_bought': ['B00D2JSRFQ', '0000032042', '0000032050', 'B00D2JTMS2', 'B00D0FDUAY', 'B00D0FIIJM', 'B00D2JRWWA', 'B00D0F450I', 'B00D0FCQQI', 'B00H3RYN3I', 'B002I55DT8', 'B00498HUQ6', 'B001YZCF1M', 'B00FNNFXAG', 'B00EON0SJ2', 'B000J09OV2', 'B0048WRX5G', 'B00I2EOG92', 'B003UM99FC', 'B00D0DJAEG', '0000032069', 'B00I2S01I8', 'B003AVKOP2', 'B003CPDAUW', 'B005HMHOQ4', 'B00JHONN1S', 'B00GOR07RE', 'B007TMMVJA', 'B00DPPRW2G', 'B0089ND408', 'B0046W9T8C', 'B005HMHOQE', 'B00EOOR812', 'B00CLZWXYI', 'B008AU29UQ', 'B00BNRKT6E', 'B004YHFSIO', 'B00EB5WN9Q', 'B008UBQZKU', 'B00D0D5F6S', 'B004PYEE8G', 'B00FQU9ZUA', 'B008AABRPO', 'B007BZ5CUA', 'B00I5SCG7E', 'B0036LOTNO', 'B009WPT2SA', 'B009QVCTTY', 'B00KZN5J8U', 'B008B81LN8', 'B00E1YRI4C', 'B004SVOVSE', 'B002I4ZJ1Q', 'B005AZMN3C', 'B00BBQFGWO', 'B009QVQZ3K', 'B005C4Y4F6', 'B008CLA6HG', 'B0085D9V1S', 'B000M55BDY', 'B00FE9DIHO', 'B009QVWIQ8', 'B00LIPP114', 'B001YHX45G', 'B00BN6S8WC', 'B009MDB6FE', 'B007AK1KTS', 'B00J6LZ16M', 'B002FRPE9I', 'B002RHLXKU', 'B006F404KQ', 'B00362OQQI', 'B003UNHJ4Y', 'B00D10CLVW', 'B002BZX8Z6', 'B0041EOTJO', 'B00F3KZUPC', 'B0055A1F4A', 'B0035BGVYU', 'B000P18LZ0', 'B007F2H4PU', 'B004XHVUE6', 'B00KF54D6W', 'B0097B1D8G', 'B00840TWES', 'B0050GAHKC', 'B00I9JSUO2', 'B003HCYEQY', 'B0075CNY7M', 'B00AFDOPDA', 'B008FCA0F0', 'B000IRG356', 'B00DSVBR7S', 'B00DYIQ8E2', 'B0041BVA80', 'B009M7FWBE'], 'also_viewed': ['B00D2JSRFQ', '0000032050', 'B00JHNSNSM', '0000032042', 'B00D2JTMS2', 'B003AVKOP2', 'B004YHFSIO', 'B00GOR07RE', 'B00D0FDUAY', 'B004PYEE8G', 'B00D0FCQQI', 'B009WPT2SA', 'B003AVNY6I', 'B00EON0SJ2', 'B005C4Y4F6', 'B00D0FIIJM', 'B00FNNFXAG', 'B00D0F450I', 'B00D2JRWWA', 'B008F0SU0Y', 'B00JHONN1S', 'B00FE9DIHO', 'B0071KR2LC', 'B00H3RYN3I', 'B00IIK61WA', 'B00D0DJAEG', 'B00KZN6RVI', 'B0054TBWKO', 'B00GEDG8D0', 'B00JMX4CCS', 'B00K18LJ6U', 'B0079MCIMU', 'B005HMHOQY', 'B009RXWNSI', 'B007XAI4LW', 'B007IEFTO8', 'B00E1YRI4C', 'B007R2RM8W', 'B002BZX8Z6', 'B00IIK61UW', 'B008F0SMUC', 'B00KF54D6W', 'B00E1Q66BG', 'B003WFSLBA', 'B00IJVASUE', 'B00DPPRW2G', 'B00HAVJ48G', 'B002C3Y6WG', 'B00I5RLL4Y', 'B003AVEU6G', 'B00E95LC8Q', 'B005F50FXC', 'B002U03H1M', 'B00E87F196', 'B008A7TFK6', 'B00HSOJB9M', 'B008A7TFGK', 'B00DPYOB2Q', '0375851682', 'B00CSFEENY'], 'bought_together': ['B00D2JSRFQ', '0000032050', '0000032042', 'B00D2JTMS2']}, 'brand': 'BubuBibi', 'categories': [['Sports & Outdoors', 'Other Sports', 'Dance', 'Clothing', 'Girls', 'Skirts']]}\n", '{\'asin\': \'0000589012\', \'title\': "Why Don\'t They Just Quit? DVD Roundtable Discussion: What Families and Friends need to Know About Addiction and Recovery", \'price\': 15.95, \'imUrl\': \'http://ecx.images-amazon.com/images/I/519%2B1kseM3L._SY300_.jpg\', \'related\': {\'also_bought\': [\'B000Z3N1HQ\', \'0578045427\', \'B007VI5AQ8\', \'B003AC98V2\', \'B004V4RW8O\', \'B000I0QL7I\', \'B000J10F8C\', \'B0007CEXYK\', \'B000ERVK4Y\', \'B000XSKDBA\', \'B002UNMWTC\', \'B00008MTXI\', \'B007TSV4GK\', \'B0052ADP6Y\', \'B00EUENWIY\', \'B003YKYX9M\', \'B004RD3YFE\', \'B007Y9F6RW\', \'B00004UEDQ\', \'B0039Y774Q\', \'B0006IIKRG\', \'B00JAGF9HE\', \'6305162026\', \'6305692572\', \'B001D7T460\', \'B0018QOIWG\', \'B002Y7ZELW\', \'B0045HCJ08\', \'0830907394\', \'B000LAZDPG\', \'B00A2H9QN8\', \'B001O5CLXY\', \'B000JBXXYK\', \'B003B3NGS6\', \'B0037SR3N4\', \'B00641Y2ZS\', \'0470903953\', \'0977977315\', \'B00049QQHI\', \'B000E6ESU8\', \'0470402741\', \'061565732X\', \'0615763146\', \'B000VZPTH8\', \'B003JO6OPO\', \'B00787BTEO\', \'B004R1Q7YQ\', \'B001GG6GKK\', \'B0015VQAZM\', \'1592854869\', \'B000QRIL08\', \'B000GQLA8O\', \'B000MPM3TE\', \'0979021804\', \'1608823407\', \'159285821X\', \'B00005Q4CS\', \'B0000549B1\', \'6305594333\', \'B00AFEXRME\', \'B004FN25AG\', \'0830906363\', \'0470402768\', \'1118414756\', \'B009SV4O2M\', \'1481106694\', \'1572306254\', \'B0013MOLPO\', \'B00009Y3QI\', \'B003NMOL2U\', \'B001AKBI8C\', \'0981708803\', \'1572306394\', \'B00B9LNPA6\', \'B005BYBZEK\', \'B004D7SBMU\', \'B00CQMADIO\', \'0470405511\', \'B00CHEHHT4\', \'B000ESUWY2\', \'0792838068\', \'B00AWE09Z0\', \'B00E4XZZEK\', \'0830914870\', \'B00GFZLEF4\', \'083090459X\', \'1402218443\', \'1893007170\', \'1893277046\', \'B005CKI7H6\', \'B0001LQL6K\', \'B000067S10\', \'0890425558\', \'B00114KYC8\', \'1466221224\', \'0943158508\', \'B00A7ID5BG\', \'0671765582\', \'B000B8IH10\', \'1568381395\'], \'buy_after_viewing\': [\'B003AC98V2\', \'B007VI5AQ8\', \'B000ERVK4Y\', \'B0007CEXYK\']}, \'salesRank\': {\'Movies & TV\': 1084845}, \'categories\': [[\'Movies & TV\', \'Movies\']]}\n', "{'asin': '0001048775', 'description': 'William Shakespeare is widely regarded as the greatest playwright the world has seen. He produced an astonishing amount of work; 37 plays, 154 sonnets, and 5 poems. He died on 23rd April 1616, aged 52, and was buried in the Holy Trinity Church, Stratford.', 'title': 'Measure for Measure: Complete &amp; Unabridged', 'imUrl': 'http://ecx.images-amazon.com/images/I/5166EBHDQYL.jpg', 'salesRank': {'Books': 13243226}, 'categories': [['Books']]}\n", "{'asin': '0000031852', 'related': {'also_bought': ['B00JHONN1S', 'B002BZX8Z6', 'B00D2K1M3O', '0000031909', 'B00613WDTQ', 'B00D0WDS9A', 'B00D0GCI8S', '0000031895', 'B003AVKOP2', 'B003AVEU6G', 'B003IEDM9Q', 'B002R0FA24', 'B00D23MC6W', 'B00D2K0PA0', 'B00538F5OK', 'B00CEV86I6', 'B002R0FABA', 'B00D10CLVW', 'B003AVNY6I', 'B002GZGI4E', 'B001T9NUFS', 'B002R0F7FE', 'B00E1YRI4C', 'B008UBQZKU', 'B00D103F8U', 'B007R2RM8W'], 'also_viewed': ['B002BZX8Z6', 'B00JHONN1S', 'B008F0SU0Y', 'B00D23MC6W', 'B00AFDOPDA', 'B00E1YRI4C', 'B002GZGI4E', 'B003AVKOP2', 'B00D9C1WBM', 'B00CEV8366', 'B00CEUX0D8', 'B0079ME3KU', 'B00CEUWY8K', 'B004FOEEHC', '0000031895', 'B00BC4GY9Y', 'B003XRKA7A', 'B00K18LKX2', 'B00EM7KAG6', 'B00AMQ17JA', 'B00D9C32NI', 'B002C3Y6WG', 'B00JLL4L5Y', 'B003AVNY6I', 'B008UBQZKU', 'B00D0WDS9A', 'B00613WDTQ', 'B00538F5OK', 'B005C4Y4F6', 'B004LHZ1NY', 'B00CPHX76U', 'B00CEUWUZC', 'B00IJVASUE', 'B00GOR07RE', 'B00J2GTM0W', 'B00JHNSNSM', 'B003IEDM9Q', 'B00CYBU84G', 'B008VV8NSQ', 'B00CYBULSO', 'B00I2UHSZA', 'B005F50FXC', 'B007LCQI3S', 'B00DP68AVW', 'B009RXWNSI', 'B003AVEU6G', 'B00HSOJB9M', 'B00EHAGZNA', 'B0046W9T8C', 'B00E79VW6Q', 'B00D10CLVW', 'B00B0AVO54', 'B00E95LC8Q', 'B00GOR92SO', 'B007ZN5Y56', 'B00AL2569W', 'B00B608000', 'B008F0SMUC', 'B00BFXLZ8M'], 'bought_together': ['B002BZX8Z6']}, 'title': 'Girls Ballet Tutu Zebra Hot Pink', 'price': 3.17, 'salesRank': {'Toys & Games': 211836}, 'imUrl': 'http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg', 'brand': 'Coxlures', 'categories': [['Sports & Outdoors', 'Other Sports', 'Dance']], 'description': 'TUtu'}\n", '{\'asin\': \'0001048236\', \'categories\': [[\'Books\']], \'description\': "&#34;One thing is certain, Sherlockians, put aside your Baring-GouldAnnotated, your Folio SocietyIllustrated-for the time being, the Oxford is the edition to curl up with on a winter\'s night&#34;--The Chicago Tribune&#34;An incomparable gift book; or, should you find it impossible to surrender up such treasures, the best of gifts to oneself&#34;--USA Today&#34;To the true Sherlockian, this will be a treasure; to otherwise diverted detective story fans, it is a rich lode for discovery&#34;--Denver Post&#34;The complete and authentic adventures of the legendary detective--expertly edited and annotated by a team of Holmes scholars....in a handsome, boxed set....A lovely gift&#34;--The Christian Science Monitor&#34;Here in nine volumes...are all the adventures of Holmes and Watson. Each book has an introduction, something new and fascinating for even the most devoted Holmesians plus a series of intelligent notes at the back of each volume.&#34;--Oxford Times&#34;TheOxford Sherlock Holmes, a new edition of the stories, is a splendid piece of publishing. Nine compact volumes, beautifully produced, each with a stimulating introduction; clear type, accurate texts, a handy chronology, a helpful bibliography. And, most valuable of all, explanatory notes running to 50 pages or more per volume.&#34; --John Gross, writing inSunday Telegraph--This text refers to thePaperbackedition.", \'title\': \'The Sherlock Holmes Audio Collection\', \'price\': 9.26, \'salesRank\': {\'Books\': 8973864}, \'imUrl\': \'http://ecx.images-amazon.com/images/I/51DH145C5JL.jpg\', \'related\': {\'also_viewed\': [\'1442300191\', \'9626349786\', \'1602837155\', \'1598879162\', \'1400115159\', \'1478396202\', \'1408426250\', \'B007PM2A4A\', \'1609980603\'], \'buy_after_viewing\': [\'0312089457\']}}\n'] 

How can I write a python script to load the json file in pandas?

2 Answers

Answers 1

I didn't really understand the problem. Was it a memory consumption or a file format issue? What errors occurred while using pd.read_json() ?

If it's a file format issue (e.g. raised ValueError: Trailing data):

Based on the source link you posted, it seems that the file is NOT an ordinary json file (see "Reading the data" section). It's a line-delimited json file (each line is a json object) a widely used format for json streaming.

To read line-delimited json files, pass lines=True to pd.read_json():

pd.read_json('path/to/file',lines=True) 

Note that you need Pandas ver. 0.19.0+ to use this.

Answers 2

For your original data, you can directly import it into pandas DataFrame, once read into a dict:

json_data = ['{"asin": "0000031852", "categories": [["Sports & Outdoors", "Other Sports", "Dance"]], "title": "Girls Ballet Tutu Zebra Hot Pink", "price": 3.17, "salesRank": {"Toys & Games": 211836}, "imUrl": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg", "brand": "Coxlures", "related": {"also_bought": ["B00JHONN1S", "B002BZX8Z6", "B00D2K1M3O", "0000031909", "B00613WDTQ", "B00D0WDS9A", "B00D0GCI8S", "0000031895", "B003AVKOP2", "B003AVEU6G", "B003IEDM9Q", "B002R0FA24", "B00D23MC6W", "B00D2K0PA0", "B00538F5OK", "B00CEV86I6", "B002R0FABA", "B00D10CLVW", "B003AVNY6I", "B002GZGI4E", "B001T9NUFS", "B002R0F7FE", "B00E1YRI4C", "B008UBQZKU", "B00D103F8U", "B007R2RM8W"], "also_viewed": ["B002BZX8Z6", "B00JHONN1S", "B008F0SU0Y", "B00D23MC6W", "B00AFDOPDA", "B00E1YRI4C", "B002GZGI4E", "B003AVKOP2", "B00D9C1WBM", "B00CEV8366", "B00CEUX0D8", "B0079ME3KU", "B00CEUWY8K", "B004FOEEHC", "0000031895", "B00BC4GY9Y", "B003XRKA7A", "B00K18LKX2", "B00EM7KAG6", "B00AMQ17JA", "B00D9C32NI", "B002C3Y6WG", "B00JLL4L5Y", "B003AVNY6I", "B008UBQZKU", "B00D0WDS9A", "B00613WDTQ", "B00538F5OK", "B005C4Y4F6", "B004LHZ1NY", "B00CPHX76U", "B00CEUWUZC", "B00IJVASUE", "B00GOR07RE", "B00J2GTM0W", "B00JHNSNSM", "B003IEDM9Q", "B00CYBU84G", "B008VV8NSQ", "B00CYBULSO", "B00I2UHSZA", "B005F50FXC", "B007LCQI3S", "B00DP68AVW", "B009RXWNSI", "B003AVEU6G", "B00HSOJB9M", "B00EHAGZNA", "B0046W9T8C", "B00E79VW6Q", "B00D10CLVW", "B00B0AVO54", "B00E95LC8Q", "B00GOR92SO", "B007ZN5Y56", "B00AL2569W", "B00B608000", "B008F0SMUC", "B00BFXLZ8M"], "bought_together": ["B002BZX8Z6"]}}']  import json  data = [json.loads(d) for d in json_data]  # data now looks like below, as shared in post data = [{ "asin": "0000031852", "title": "Girls Ballet Tutu Zebra Hot Pink", "price": 3.17, "imUrl": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg", "related": { "also_bought": ["B00JHONN1S", "B002BZX8Z6", "B00D2K1M3O", "0000031909", "B00613WDTQ", "B00D0WDS9A", "B00D0GCI8S", "0000031895", "B003AVKOP2", "B003AVEU6G", "B003IEDM9Q", "B002R0FA24", "B00D23MC6W", "B00D2K0PA0", "B00538F5OK", "B00CEV86I6", "B002R0FABA", "B00D10CLVW", "B003AVNY6I", "B002GZGI4E", "B001T9NUFS", "B002R0F7FE", "B00E1YRI4C", "B008UBQZKU", "B00D103F8U", "B007R2RM8W"], "also_viewed": ["B002BZX8Z6", "B00JHONN1S", "B008F0SU0Y", "B00D23MC6W", "B00AFDOPDA", "B00E1YRI4C", "B002GZGI4E", "B003AVKOP2", "B00D9C1WBM", "B00CEV8366", "B00CEUX0D8", "B0079ME3KU", "B00CEUWY8K", "B004FOEEHC", "0000031895", "B00BC4GY9Y", "B003XRKA7A", "B00K18LKX2", "B00EM7KAG6", "B00AMQ17JA", "B00D9C32NI", "B002C3Y6WG", "B00JLL4L5Y", "B003AVNY6I", "B008UBQZKU", "B00D0WDS9A", "B00613WDTQ", "B00538F5OK", "B005C4Y4F6", "B004LHZ1NY", "B00CPHX76U", "B00CEUWUZC", "B00IJVASUE", "B00GOR07RE", "B00J2GTM0W", "B00JHNSNSM", "B003IEDM9Q", "B00CYBU84G", "B008VV8NSQ", "B00CYBULSO", "B00I2UHSZA", "B005F50FXC", "B007LCQI3S", "B00DP68AVW", "B009RXWNSI", "B003AVEU6G", "B00HSOJB9M", "B00EHAGZNA", "B0046W9T8C", "B00E79VW6Q", "B00D10CLVW", "B00B0AVO54", "B00E95LC8Q", "B00GOR92SO", "B007ZN5Y56", "B00AL2569W", "B00B608000", "B008F0SMUC", "B00BFXLZ8M"], "bought_together": ["B002BZX8Z6"] }, "salesRank": {"Toys & Games": 211836}, "brand": "Coxlures", "categories": [["Sports & Outdoors", "Other Sports", "Dance"]] }]  import pandas  pandas.DataFrame(data) 

However, the data that you have attached in your extended example is not Json, but a string representation of a dict. So, you will need to do a literal_eval to get the actual data:

data = ["{'asin': '0001048791', 'salesRank': {'Books': 6334800}, 'imUrl': 'http://ecx.images-amazon.com/images/I/51MKP0T4DBL.jpg', 'categories': [['Books']], 'title': 'The Crucible: Performed by Stuart Pankin, Jerome Dempsey &amp; Cast'}\n", "{'asin': '0000143561', 'categories': [['Movies & TV', 'Movies']], 'description': '3Pack DVD set - Italian Classics, Parties and Holidays.', 'title': 'Everyday Italian (with Giada de Laurentiis), Volume 1 (3 Pack): Italian Classics, Parties, Holidays', 'price': 12.99, 'salesRank': {'Movies & TV': 376041}, 'imUrl': 'http://g-ecx.images-amazon.com/images/G/01/x-site/icons/no-img-sm._CB192198896_.gif', 'related': {'also_viewed': ['B0036FO6SI', 'B000KL8ODE', '000014357X', 'B0037718RC', 'B002I5GNVU', 'B000RBU4BM'], 'buy_after_viewing': ['B0036FO6SI', 'B000KL8ODE', '000014357X', 'B0037718RC']}}\n", "{'asin': '0000037214', 'related': {'also_viewed': ['B00JO8II76', 'B00DGN4R1Q', 'B00E1YRI4C']}, 'title': 'Purple Sequin Tiny Dancer Tutu Ballet Dance Fairy Princess Costume Accessory', 'price': 6.99, 'salesRank': {'Clothing': 1233557}, 'imUrl': 'http://ecx.images-amazon.com/images/I/31mCncNuAZL.jpg', 'brand': 'Big Dreams', 'categories': [['Clothing, Shoes & Jewelry', 'Girls'], ['Clothing, Shoes & Jewelry', 'Novelty, Costumes & More', 'Costumes & Accessories', 'More Accessories', 'Kids & Baby']]}\n", "{'asin': '0000032069', 'title': 'Adult Ballet Tutu Cheetah Pink', 'price': 7.89, 'imUrl': 'http://ecx.images-amazon.com/images/I/51EzU6quNML._SX342_.jpg', 'related': {'also_bought': ['0000032050', 'B00D0DJAEG', '0000032042', 'B00D0F450I', 'B00D2JTMS2', 'B00D0FDUAY', 'B00D2JSRFQ', '0000032034', 'B00D0D5F6S', 'B00D2JRWWA', 'B00D0FIIJM', 'B00D0FCQQI', 'B00EXVN9PU', 'B0041EOTJO', 'B004PYEE8G', 'B001GTKPDQ', 'B00EON0SJ2', 'B005HMHOQ4', 'B002XZMGGQ'], 'also_viewed': ['B00D0F450I', '0000032050', 'B00D2JTMS2', '0000032042', 'B004PYEE8G', 'B00JHNSNSM', 'B00D0DJAEG', 'B00D2JSRFQ', 'B00D0FCQQI', 'B00D2JRWWA', 'B003AVNY6I', 'B0071KR2LC', 'B00GOR07RE', 'B00D0FIIJM', 'B005F50FXC', 'B0079MCIMU', 'B00D0FDUAY', 'B00H3RYN3I', 'B005C4Y4F6', 'B007IEFT84', 'B00D0D5F6S', 'B002BZX8Z6', 'B00JHONN1S', 'B008F0SU0Y', 'B00FNNFXAG', 'B007R2RM8W', 'B007VM3AMK', 'B00C0PLENA', 'B00BJGG6VG', 'B00E1YRI4C', 'B00IIK61WA', 'B009UC638W', 'B00KZN6RVI', 'B00CSFEENY', 'B002GZGI4E', 'B00HSOJJ94', 'B00LIPP4VG', 'B009RXWNSI', 'B00E87F196', 'B005HMHOQY', 'B00J6S9MSS', '0000032034', 'B00CJQGNJK', 'B008FCA0F0', 'B0056LG7GY', 'B00DPQWCZ2', 'B00I3PV0US', 'B00KZN6IVW', 'B0054TBWKO', 'B00I2S01I8', 'B00BXF12P8', 'B00GVHU678', 'B005NWENGC', 'B003AVKOP2', 'B00JK8MQ4Q', 'B00FZIMVQS', 'B008BB19VE', 'B00GTEXPOE', 'B009WPT2RQ', 'B00E37SBBG'], 'bought_together': ['0000032050', 'B00D0DJAEG', '0000032042', 'B00D0F450I']}, 'brand': 'BubuBibi', 'categories': [['Sports & Outdoors', 'Other Sports', 'Dance', 'Clothing', 'Girls', 'Skirts']]}\n", "{'asin': '0000031909', 'related': {'also_bought': ['B002BZX8Z6', 'B00JHONN1S', '0000031895', 'B00D2K1M3O', '0000031852', 'B00D0WDS9A', 'B00D10CLVW', 'B00D103F8U', 'B003AVEU6G', 'B00D2K0PA0', 'B002GZGI4E', 'B00D0ZF44Y', 'B008F0SMUC', 'B00D0GCI8S', 'B008F0SU0Y', 'B002YSCPZY', '0448408775', 'B002R0FABA', 'B008GHWNWC', 'B002R0FA24', 'B001GTKPEK', 'B006XA7KZO', 'B001GZUQ9S', 'B00613VNL0', 'B003IEDM9Q', 'B003LTOZK8', 'B003AVNY6I', 'B008UBQZKU', 'B001AQD8VQ', 'B003ILA0L2', 'B00AFDOPDA', 'B002R0F7FE'], 'also_viewed': ['B002BZX8Z6', 'B00JHONN1S', 'B008F0SU0Y', 'B00E1YRI4C', 'B00AFDOPDA', 'B002GZGI4E', 'B00CEUWY8K', 'B003IEDM9Q', 'B00HSOJB9M', '0000031895', 'B003AVKOP2', 'B005C4Y4F6', 'B008F0SMUC', 'B00362QGW0', 'B008UD01L2', 'B00FAZ5ZE6', 'B008F0SY6O', 'B00DPLLQR2', 'B00CEUWUZC', 'B004PYEE8G', 'B003AVNY6I', 'B00CEUX0D8', 'B00JHNSNSM', 'B00D10CLVW', 'B00D23MC6W', 'B007XAI53E', 'B008X6CBS2', 'B004PEI45U', 'B005HMHOQ4', 'B002C3Y6WG', 'B00HSC8O74', 'B008BMGHM4', 'B00CEUWTFS', 'B00FNNFXAG', 'B00CYBU84G', 'B00D9C32NI', 'B0046W9T8C', 'B008UBG5IW', 'B001YHX45G', 'B00CEV8366', 'B00I2UHSZA', 'B009RXWNSI', 'B008FCA0F0', 'B001GTKPEK', 'B004TU1VPU', 'B00CBPIO7S', 'B00CHHXJ0M', 'B00538F5OK', 'B005F50FXC', 'B00CEUX4QQ', 'B003XRKA7A', '0000031852', 'B002C3R5XI', 'B00C6Q1Z6E'], 'bought_together': ['B002BZX8Z6']}, 'title': 'Girls Ballet Tutu Neon Pink', 'price': 7.0, 'salesRank': {'Toys & Games': 201847}, 'imUrl': 'http://ecx.images-amazon.com/images/I/41xBoP0FVzL._SY300_.jpg', 'brand': 'Unknown', 'categories': [['Sports & Outdoors', 'Other Sports', 'Dance']], 'description': 'High quality 3 layer ballet tutu. 12 inches in length'}\n", "{'asin': '0000032034', 'title': 'Adult Ballet Tutu Yellow', 'price': 7.87, 'imUrl': 'http://ecx.images-amazon.com/images/I/21GNUNIa1CL.jpg', 'related': {'also_bought': ['B00D2JSRFQ', '0000032042', '0000032050', 'B00D2JTMS2', 'B00D0FDUAY', 'B00D0FIIJM', 'B00D2JRWWA', 'B00D0F450I', 'B00D0FCQQI', 'B00H3RYN3I', 'B002I55DT8', 'B00498HUQ6', 'B001YZCF1M', 'B00FNNFXAG', 'B00EON0SJ2', 'B000J09OV2', 'B0048WRX5G', 'B00I2EOG92', 'B003UM99FC', 'B00D0DJAEG', '0000032069', 'B00I2S01I8', 'B003AVKOP2', 'B003CPDAUW', 'B005HMHOQ4', 'B00JHONN1S', 'B00GOR07RE', 'B007TMMVJA', 'B00DPPRW2G', 'B0089ND408', 'B0046W9T8C', 'B005HMHOQE', 'B00EOOR812', 'B00CLZWXYI', 'B008AU29UQ', 'B00BNRKT6E', 'B004YHFSIO', 'B00EB5WN9Q', 'B008UBQZKU', 'B00D0D5F6S', 'B004PYEE8G', 'B00FQU9ZUA', 'B008AABRPO', 'B007BZ5CUA', 'B00I5SCG7E', 'B0036LOTNO', 'B009WPT2SA', 'B009QVCTTY', 'B00KZN5J8U', 'B008B81LN8', 'B00E1YRI4C', 'B004SVOVSE', 'B002I4ZJ1Q', 'B005AZMN3C', 'B00BBQFGWO', 'B009QVQZ3K', 'B005C4Y4F6', 'B008CLA6HG', 'B0085D9V1S', 'B000M55BDY', 'B00FE9DIHO', 'B009QVWIQ8', 'B00LIPP114', 'B001YHX45G', 'B00BN6S8WC', 'B009MDB6FE', 'B007AK1KTS', 'B00J6LZ16M', 'B002FRPE9I', 'B002RHLXKU', 'B006F404KQ', 'B00362OQQI', 'B003UNHJ4Y', 'B00D10CLVW', 'B002BZX8Z6', 'B0041EOTJO', 'B00F3KZUPC', 'B0055A1F4A', 'B0035BGVYU', 'B000P18LZ0', 'B007F2H4PU', 'B004XHVUE6', 'B00KF54D6W', 'B0097B1D8G', 'B00840TWES', 'B0050GAHKC', 'B00I9JSUO2', 'B003HCYEQY', 'B0075CNY7M', 'B00AFDOPDA', 'B008FCA0F0', 'B000IRG356', 'B00DSVBR7S', 'B00DYIQ8E2', 'B0041BVA80', 'B009M7FWBE'], 'also_viewed': ['B00D2JSRFQ', '0000032050', 'B00JHNSNSM', '0000032042', 'B00D2JTMS2', 'B003AVKOP2', 'B004YHFSIO', 'B00GOR07RE', 'B00D0FDUAY', 'B004PYEE8G', 'B00D0FCQQI', 'B009WPT2SA', 'B003AVNY6I', 'B00EON0SJ2', 'B005C4Y4F6', 'B00D0FIIJM', 'B00FNNFXAG', 'B00D0F450I', 'B00D2JRWWA', 'B008F0SU0Y', 'B00JHONN1S', 'B00FE9DIHO', 'B0071KR2LC', 'B00H3RYN3I', 'B00IIK61WA', 'B00D0DJAEG', 'B00KZN6RVI', 'B0054TBWKO', 'B00GEDG8D0', 'B00JMX4CCS', 'B00K18LJ6U', 'B0079MCIMU', 'B005HMHOQY', 'B009RXWNSI', 'B007XAI4LW', 'B007IEFTO8', 'B00E1YRI4C', 'B007R2RM8W', 'B002BZX8Z6', 'B00IIK61UW', 'B008F0SMUC', 'B00KF54D6W', 'B00E1Q66BG', 'B003WFSLBA', 'B00IJVASUE', 'B00DPPRW2G', 'B00HAVJ48G', 'B002C3Y6WG', 'B00I5RLL4Y', 'B003AVEU6G', 'B00E95LC8Q', 'B005F50FXC', 'B002U03H1M', 'B00E87F196', 'B008A7TFK6', 'B00HSOJB9M', 'B008A7TFGK', 'B00DPYOB2Q', '0375851682', 'B00CSFEENY'], 'bought_together': ['B00D2JSRFQ', '0000032050', '0000032042', 'B00D2JTMS2']}, 'brand': 'BubuBibi', 'categories': [['Sports & Outdoors', 'Other Sports', 'Dance', 'Clothing', 'Girls', 'Skirts']]}\n", '{\'asin\': \'0000589012\', \'title\': "Why Don\'t They Just Quit? DVD Roundtable Discussion: What Families and Friends need to Know About Addiction and Recovery", \'price\': 15.95, \'imUrl\': \'http://ecx.images-amazon.com/images/I/519%2B1kseM3L._SY300_.jpg\', \'related\': {\'also_bought\': [\'B000Z3N1HQ\', \'0578045427\', \'B007VI5AQ8\', \'B003AC98V2\', \'B004V4RW8O\', \'B000I0QL7I\', \'B000J10F8C\', \'B0007CEXYK\', \'B000ERVK4Y\', \'B000XSKDBA\', \'B002UNMWTC\', \'B00008MTXI\', \'B007TSV4GK\', \'B0052ADP6Y\', \'B00EUENWIY\', \'B003YKYX9M\', \'B004RD3YFE\', \'B007Y9F6RW\', \'B00004UEDQ\', \'B0039Y774Q\', \'B0006IIKRG\', \'B00JAGF9HE\', \'6305162026\', \'6305692572\', \'B001D7T460\', \'B0018QOIWG\', \'B002Y7ZELW\', \'B0045HCJ08\', \'0830907394\', \'B000LAZDPG\', \'B00A2H9QN8\', \'B001O5CLXY\', \'B000JBXXYK\', \'B003B3NGS6\', \'B0037SR3N4\', \'B00641Y2ZS\', \'0470903953\', \'0977977315\', \'B00049QQHI\', \'B000E6ESU8\', \'0470402741\', \'061565732X\', \'0615763146\', \'B000VZPTH8\', \'B003JO6OPO\', \'B00787BTEO\', \'B004R1Q7YQ\', \'B001GG6GKK\', \'B0015VQAZM\', \'1592854869\', \'B000QRIL08\', \'B000GQLA8O\', \'B000MPM3TE\', \'0979021804\', \'1608823407\', \'159285821X\', \'B00005Q4CS\', \'B0000549B1\', \'6305594333\', \'B00AFEXRME\', \'B004FN25AG\', \'0830906363\', \'0470402768\', \'1118414756\', \'B009SV4O2M\', \'1481106694\', \'1572306254\', \'B0013MOLPO\', \'B00009Y3QI\', \'B003NMOL2U\', \'B001AKBI8C\', \'0981708803\', \'1572306394\', \'B00B9LNPA6\', \'B005BYBZEK\', \'B004D7SBMU\', \'B00CQMADIO\', \'0470405511\', \'B00CHEHHT4\', \'B000ESUWY2\', \'0792838068\', \'B00AWE09Z0\', \'B00E4XZZEK\', \'0830914870\', \'B00GFZLEF4\', \'083090459X\', \'1402218443\', \'1893007170\', \'1893277046\', \'B005CKI7H6\', \'B0001LQL6K\', \'B000067S10\', \'0890425558\', \'B00114KYC8\', \'1466221224\', \'0943158508\', \'B00A7ID5BG\', \'0671765582\', \'B000B8IH10\', \'1568381395\'], \'buy_after_viewing\': [\'B003AC98V2\', \'B007VI5AQ8\', \'B000ERVK4Y\', \'B0007CEXYK\']}, \'salesRank\': {\'Movies & TV\': 1084845}, \'categories\': [[\'Movies & TV\', \'Movies\']]}\n', "{'asin': '0001048775', 'description': 'William Shakespeare is widely regarded as the greatest playwright the world has seen. He produced an astonishing amount of work; 37 plays, 154 sonnets, and 5 poems. He died on 23rd April 1616, aged 52, and was buried in the Holy Trinity Church, Stratford.', 'title': 'Measure for Measure: Complete &amp; Unabridged', 'imUrl': 'http://ecx.images-amazon.com/images/I/5166EBHDQYL.jpg', 'salesRank': {'Books': 13243226}, 'categories': [['Books']]}\n", "{'asin': '0000031852', 'related': {'also_bought': ['B00JHONN1S', 'B002BZX8Z6', 'B00D2K1M3O', '0000031909', 'B00613WDTQ', 'B00D0WDS9A', 'B00D0GCI8S', '0000031895', 'B003AVKOP2', 'B003AVEU6G', 'B003IEDM9Q', 'B002R0FA24', 'B00D23MC6W', 'B00D2K0PA0', 'B00538F5OK', 'B00CEV86I6', 'B002R0FABA', 'B00D10CLVW', 'B003AVNY6I', 'B002GZGI4E', 'B001T9NUFS', 'B002R0F7FE', 'B00E1YRI4C', 'B008UBQZKU', 'B00D103F8U', 'B007R2RM8W'], 'also_viewed': ['B002BZX8Z6', 'B00JHONN1S', 'B008F0SU0Y', 'B00D23MC6W', 'B00AFDOPDA', 'B00E1YRI4C', 'B002GZGI4E', 'B003AVKOP2', 'B00D9C1WBM', 'B00CEV8366', 'B00CEUX0D8', 'B0079ME3KU', 'B00CEUWY8K', 'B004FOEEHC', '0000031895', 'B00BC4GY9Y', 'B003XRKA7A', 'B00K18LKX2', 'B00EM7KAG6', 'B00AMQ17JA', 'B00D9C32NI', 'B002C3Y6WG', 'B00JLL4L5Y', 'B003AVNY6I', 'B008UBQZKU', 'B00D0WDS9A', 'B00613WDTQ', 'B00538F5OK', 'B005C4Y4F6', 'B004LHZ1NY', 'B00CPHX76U', 'B00CEUWUZC', 'B00IJVASUE', 'B00GOR07RE', 'B00J2GTM0W', 'B00JHNSNSM', 'B003IEDM9Q', 'B00CYBU84G', 'B008VV8NSQ', 'B00CYBULSO', 'B00I2UHSZA', 'B005F50FXC', 'B007LCQI3S', 'B00DP68AVW', 'B009RXWNSI', 'B003AVEU6G', 'B00HSOJB9M', 'B00EHAGZNA', 'B0046W9T8C', 'B00E79VW6Q', 'B00D10CLVW', 'B00B0AVO54', 'B00E95LC8Q', 'B00GOR92SO', 'B007ZN5Y56', 'B00AL2569W', 'B00B608000', 'B008F0SMUC', 'B00BFXLZ8M'], 'bought_together': ['B002BZX8Z6']}, 'title': 'Girls Ballet Tutu Zebra Hot Pink', 'price': 3.17, 'salesRank': {'Toys & Games': 211836}, 'imUrl': 'http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg', 'brand': 'Coxlures', 'categories': [['Sports & Outdoors', 'Other Sports', 'Dance']], 'description': 'TUtu'}\n", '{\'asin\': \'0001048236\', \'categories\': [[\'Books\']], \'description\': "&#34;One thing is certain, Sherlockians, put aside your Baring-GouldAnnotated, your Folio SocietyIllustrated-for the time being, the Oxford is the edition to curl up with on a winter\'s night&#34;--The Chicago Tribune&#34;An incomparable gift book; or, should you find it impossible to surrender up such treasures, the best of gifts to oneself&#34;--USA Today&#34;To the true Sherlockian, this will be a treasure; to otherwise diverted detective story fans, it is a rich lode for discovery&#34;--Denver Post&#34;The complete and authentic adventures of the legendary detective--expertly edited and annotated by a team of Holmes scholars....in a handsome, boxed set....A lovely gift&#34;--The Christian Science Monitor&#34;Here in nine volumes...are all the adventures of Holmes and Watson. Each book has an introduction, something new and fascinating for even the most devoted Holmesians plus a series of intelligent notes at the back of each volume.&#34;--Oxford Times&#34;TheOxford Sherlock Holmes, a new edition of the stories, is a splendid piece of publishing. Nine compact volumes, beautifully produced, each with a stimulating introduction; clear type, accurate texts, a handy chronology, a helpful bibliography. And, most valuable of all, explanatory notes running to 50 pages or more per volume.&#34; --John Gross, writing inSunday Telegraph--This text refers to thePaperbackedition.", \'title\': \'The Sherlock Holmes Audio Collection\', \'price\': 9.26, \'salesRank\': {\'Books\': 8973864}, \'imUrl\': \'http://ecx.images-amazon.com/images/I/51DH145C5JL.jpg\', \'related\': {\'also_viewed\': [\'1442300191\', \'9626349786\', \'1602837155\', \'1598879162\', \'1400115159\', \'1478396202\', \'1408426250\', \'B007PM2A4A\', \'1609980603\'], \'buy_after_viewing\': [\'0312089457\']}}\n']  from ast import literal_eval  new_data = [literal_eval(x) for x in data]  import pandas  pandas.DataFrame(new_data) 
Read More

Monday, November 20, 2017

Counting documents by property occurrence in Kibana

Leave a Comment

I'm trying to create a visualization that looks like this:

  • Foobar, 10
  • Bar, 8
  • Baz, 5.6

The first column is the aggregation itself. Imagine i have documents like this:

{   id: 1,   name: 'lorem ipsum',   type: 'A'   author: {     name: 'Foobar',   } } {   id: 2,   name: 'dolor sit amet',   type: 'B',   author: {     name: 'Foobar',   } } 

So, i want to add a +1 to the score of "Foobar" everytime i find a document of type A. And a +2 to the score if i find a document of type B. Basically, aggregating by the author name, and calculating a dynamic value on results.

Is this possible in Kibana? Thanks for the help.

1 Answers

Answers 1

AFAIK, you can't do this in Kibana in visualize panel, maybe you can try it in program then index the result into es.

Read More

Thursday, July 6, 2017

How to use biglm with more than 2^31 observations

Leave a Comment

I am working with a large set of data that contains more than 2^31 observations. The actual number of observations is close to 3.5 billion observations.

I am using the R package "biglm" to run a regression with approximately 70 predictors. I read in the data one million rows at a time and update the regression results. The data have been saved in the ffdf format using the R library "ffdf" to load quickly and avoid using up all my RAM.

Here is the basic outline of the code I am using:

library(ff,ffbase,biglm) load.ffdf(dir='home')  dim(data) #the ffdf contains about 70 predictors and 3.5 billion rows  chunk_1 <- data[1:1000000,] rest_of_data <- data[1000000:nrow(data),]  # Running biglm for first chunk b <- biglm(y~x1+x2+...+x70, chunk_1)  chunks <- ceiling((nrow(rest_of_data)/1000000)  # Updating biglm results by iterating through the rest of the data chunks for (i in seq(1,chunks)){       start <- 1+((i-1))*1000000       end <- min(i*1000000,nrow(d))       d_chunk <- d[start:end,]       b<-update(b,d_chunk) } 

The results look great and everything is running smoothly until the cumulative number of observations from updating the model with each chunk of the data exceeds 2^31 observations. Then, I get an error that reads

In object$n + NROW(mm) : NAs produced by integer overflow 

How do I get around this overflow issue? Thanks in advance for your help!

1 Answers

Answers 1

I believe that I have found the source of the issue in the biglm code.

The number of observations (n) is stored as an integer, which has a max value of 2^31 - 1.

The numeric type is not subject to this limit, and, as far as I can tell, can be used instead of integers to store n.

Here is a commit on github showing how to fix this problem with one additional line of code that converts the integer n to a numeric. As the model is updated, the number of rows in the new batch is added to the old n, so the type of n remains numeric.

I was able to reproduce the error described in this question and verify that my fix works with this code:

(WARNING: This consumes a large amount of memory, consider doing more iterations with a smaller array if you have tight memory constraints)

library(biglm) df = as.data.frame(replicate(3, rnorm(10000000))) a = biglm(V1 ~ V2 + V3, df) for (i in 1:300) {     a = update(a, df) } print(summary(a)) 

In the original biglm library, this code outputs:

Large data regression model: biglm(ff, df) Sample size =  NA                Coef (95% CI) SE  p (Intercept) -1e-04   NA  NA NA NA V2          -1e-04   NA  NA NA NA V3          -2e-04   NA  NA NA NA 

My patched version outputs:

Large data regression model: biglm(V1 ~ V2 + V3, df) Sample size =  3.01e+09                Coef   (95%    CI) SE p (Intercept) -3e-04 -3e-04 -3e-04  0 0 V2          -2e-04 -2e-04 -1e-04  0 0 V3           3e-04  3e-04  3e-04  0 0 

The SE and p values are non-zero, just rounded in the output above.

I am fairly new to the R ecosystem, so I would appreciate it if someone could tell me how to submit this patch so that it can be reviewed by the original author and eventually included in the upstream package.

Read More

Tuesday, June 20, 2017

Moving from Relational Database to Big Data

Leave a Comment

Currently I have an application hosted on the Google Cloud Platform that offers web analytics and provides session activity (clicks, downloads etc) and ties that web activity with web registrations.

At the moment we store all of our click and session profile data in MySQL and use SQL queries to generate both aggregate and per-user reports, however, as the amount of data has grown, we are seeing a real slow-down in query responses which is in turn slowing down page-load times.

In investigating ways we can solve this problem, we have looked into tools available on Google Cloud Platform like Dataproc and Dataflow as well as NoSQL solutions, however, I am having a hard time understanding how we could apply our current solution to any of these solutions.

Currently, a rough idea of our data schema is as follows:

User table - id - name - email  Profile table (web browser/device) - id - user id - user agent string  Session table - id - profile id - session string  Action table - id - session id - action type - action details - timestamp 

Based on my research, my understanding of what would be the best solution would be to store action data in a NoSQL database solution like BigTable which feeds data into a solution like DataProc or DataFlow which generates the reports. However, given that our current schema is a highly relational structure, seems to remove the option of moving towards a NoSQL solution as all my research indicates that you shouldn't move relational data to a NoSQL solution.

My question is, is my understanding of how to apply these tools correct? Or are there better solutions? Is it even necessary to consider moving away from MySQL? And if not, what kind of solutions are available that would allow us to possibly pre-process/generate reporting data in the background?

2 Answers

Answers 1

Assuming that sessions and actions table values are not updated and only insert. The best way would be to separate the databases into two parts. Keep the MySQL DB for user and profile tables and use the BigQuery for actions and sessions.

This way you have following:

  • minimize the amount of change you have to do on the either sides (data ingestion and extraction)
  • you will significantly reduce the cost of data storage
  • query times will significantly improve
  • before you know it, you will be in the big data territory and BigQuery is just the solution for it

BigQuery is the best way. But, if you have too many extra resources and time available, you can look into storing it into NoSQL db, then run a pipeline job on it using DataFlow to extract analytics data which you will again need to store in a database for querying purposes.

Answers 2

A couple of questions / potential solutions:

  1. Profile! If it's the same queries thrashing the database, then optimising your queries or caching some of the results for your most frequent pages can help offload processing. Ditto for database settings, RAM, etc.
  2. How big is your database? If it's less than 64GB, scaling up to a larger server where the database can fit into RAM could be a quick win.
  3. How is your data being used? If it's purely for historical data, you could potentially reduce your clicks down into a lookup table, eg. actions per session per week or per user per week. If the data is collated per 5 minutes / hour, downloading the raw data and processing it like this locally can work too.
  4. You can denormalise, eg. combine user agent|session|action type|details|timestamp into one row, but you potentially increase your storage requirements and lookup time.
  5. Alternatively, more normalisation can help too. Breaking out the user agent string into its own table will reduce that table's data requirements and might speed things up.
  6. It seems like your data might be able to be split up / sharded by user, so that could be another option.

In general, the fastest way to work these questions out is to give it a try for your specific workloads, eg. how many of your typical requests (or random dashboards) can you do on a development machine with a reasonable amount of RAM (or spin up a server/create a different test database).

Also, if you're mostly used to relational databases, there'll be some overhead in switching (particularly for bleeding edge solutions), so you need to be fairly sure that the costs outweigh the benefits before you switch, or switch a little bit at a time so that you can switch back if it doesn't work out. Again, testing helps.

Read More

Saturday, February 25, 2017

How to know which stage of a job is currently running in Apache Spark?

Leave a Comment

Consider I have a job as follow in Spark;

CSV File ==> Filter By A Column ==> Taking Sample ==> Save As JSON

Now my requirement is how do I know which step(Fetching file or Filtering or Sampling) of the job is currently executing programatically (Preferably using Java API)? Is there any way for this?

I can track Job,Stage and Task using SparkListener class. And it can be done like tracking a stage Id. But how to know which stage Id is for which step in the job chain.

What I want to send a notification to user when consider Filter By A Column is completed. For that I made a class that extends SparkListener class. But I can not find out from where I can get the name of currently executing transformation name. Is it possible to track at all?

public class ProgressListener extends SparkListener{    @Override   public void onJobStart(SparkListenerJobStart jobStart)   {    }    @Override   public void onStageSubmitted(SparkListenerStageSubmitted stageSubmitted)   {       //System.out.println("Stage Name : "+stageSubmitted.stageInfo().getStatusString()); giving action name only   }    @Override   public void onTaskStart(SparkListenerTaskStart taskStart)   {       //no such method like taskStart.name()   } } 

2 Answers

Answers 1

You cannot exactly know when, e.g., the filter operation starts or finishes.

That's because you have transformations (filter,map,...) and actions (count, foreach,...). Spark will put as many operations into one stage as possible. Then the stage is executed in parallel on the different partitions of your input. And here comes the problem.

Assume you have several workers and the following program

LOAD ==> MAP ==> FILTER ==> GROUP BY + Aggregation

This program will probably have two stages: the first stage will load the file and apply the map and filter. Then the output will be shuffled to create the groups. In the second stage the aggregation will be performed.

Now, the problem is, that you have several workers and each will process a portion of your input data in parallel. That is, every executor in your cluster will receive a copy of your program(the current stage) and execute this on the assigned partition.

You see, you will have multiple instances of your map and filter operators that are executed in parallel, but not necessarily at the same time. In an extreme case, worker 1 will finish with stage 1 before worker 20 has started at all (and therefore finish with its filter operation before worker 20).

For RDDs Spark uses the iterator model inside a stage. For Datasets in the latest Spark version however, they create a single loop over the partition and execute the transformations. This means that in this case Spark itself does not really know when a transformation operator finished for a single task!

Long story short:

  1. You are not able the know when an operation inside a stage finishes
  2. Even if you could, there are multiple instances that will finish at different times.

So, now I already had the same problem:

In our Piglet project (please allow some adverstisement ;-) ) we generate Spark code from Pig Latin scripts and wanted to profile the scripts. I ended up in inserting mapPartition operator between all user operators that will send the partition ID and the current time to a server which will evaluate the messages. However, this solution also has its limitations... and I'm not completely satisfied yet.

However, unless you are able to modify the programs I'm afraid you cannot achieve want you want.

Answers 2

Did you consider this option: http://spark.apache.org/docs/latest/monitoring.html
It seems you can use the following rest api to get a certain job state /applications/[app-id]/jobs/[job-id]

You can set the JobGroupId and JobGroupDescription so you can track what job group is being handled. i.e. setJobGroup

Assuming you'll call the JobGroupId "test"

sc.setJobGroup("1", "Test job") 

When you'll call the http://localhost:4040/api/v1/applications/[app-id]/jobs/[job-id]

You'll get a json with a descriptive name for that job:

{   "jobId" : 3,   "name" : "count at <console>:25",   "description" : "Test Job",   "submissionTime" : "2017-02-22T05:52:03.145GMT",   "completionTime" : "2017-02-22T05:52:13.429GMT",   "stageIds" : [ 3 ],   "jobGroup" : "1",   "status" : "SUCCEEDED",   "numTasks" : 4,   "numActiveTasks" : 0,   "numCompletedTasks" : 4,   "numSkippedTasks" : 0,   "numFailedTasks" : 0,   "numActiveStages" : 0,   "numCompletedStages" : 1,   "numSkippedStages" : 0,   "numFailedStages" : 0 } 
Read More

Tuesday, May 3, 2016

Spark program gives odd results when ran on standalone cluster

Leave a Comment

I have this spark program and I'll try to limit it to just the pertinent parts

# Split by delimiter , # If the file is in unicode, we need to convert each value to a float in order to be able to  # treat it as a number points = sc.textFile(filename).map(lambda line: [float(x) for x in line.split(",")]).persist()  # start with K randomly selected points from the dataset # A centroid cannot be an actual data point or else the distance measure between a point and  # that centroid will be zero. This leads to an undefined membership value into that centroid. centroids = points.takeSample(False, K, 34) #print centroids # Initialize our new centroids newCentroids = [[] for k in range(K)] tempCentroids = [] for centroid in centroids:     tempCentroids.append([centroid[N] + 0.5]) #centroids = sc.broadcast(tempCentroids)  convergence = False  ncm = NCM()  while(not convergence):     memberships = points.map(lambda p : (p, getMemberships([p[N]], centroids.value, m)))     cmax = memberships.map(lambda (p, mus) : (p, getCMax(mus, centroids.value)))     # Memberships     T = cmax.map(lambda (p, c) : (p, getMemberships2([p[N]], centroids.value, m, delta, weight1, weight2, weight3, c)))     I = cmax.map(lambda (p, c) : (p, getIndeterminateMemberships([p[N]], centroids.value, m, delta, weight1, weight2,  weight3, c)[0]))     F = cmax.map(lambda (p, c) : (p, getFalseMemberships([p[N]], centroids.value, m, delta, weight1,  weight2, weight3, c)[0]))     # Components of new centroids     wTm = T.map(lambda (x, t) : ('onekey', scalarPow(m, scalarMult(weight1, t))))     #print "wTm = " + str(wTm.collect())     print "at first reduce"     sumwTm = wTm.reduceByKey(lambda p1, p2 : addPoints(p1, p2))     #print "sumwTm = " + str(sumwTm.collect())     wTmx = T.map(lambda (x, t) : pointMult([x[N]], scalarPow(m, scalarMult(weight1, t))))     print "adding to cnumerator list"     #print wTmx.collect()     cnumerator = wTmx.flatMap(lambda p: getListComponents(p)).reduceByKey(lambda p1, p2 : p1 + p2).values()     print "collected cnumerator, now printing"         #print "cnumerator = " + str(cnumerator.collect())     #print str(sumwTm.collect())     # Calculate the new centroids     sumwTmCollection = sumwTm.collect()[0][1]     cnumeratorCollection = cnumerator.collect()     #print "sumwTmCollection = " + str(sumwTmCollection)     #cnumeratorCollection =cnumerator.collectAsMap().get(0).items     print "cnumeratorCollection = " + str(cnumeratorCollection)     for i in range(len(newCentroids)):         newCentroids[i] = scalarMult(1 / sumwTmCollection[i], [cnumeratorCollection[i]])     centroids = newCentroids     # Test for convergence     convergence = ncm.test([centroids[N]], [newCentroids[N]], epsilon)      #convergence = True      # Replace our old centroids with the newly found centroids and repeat if convergence not met     # Clear out space for a new set of centroids     newCentroids = [[] for k in range(K)] 

This program works pretty well on my local machine, however, it does not behave as expected when run on a standalone cluster. It doesn't necessarily throw an error, but what it does do it give different output than that which I receive when running on my local machine. The cluster and the 3 nodes seem to be working fine. I have a feeling the problem is that I keep updating centroids, which is a python list, and it changes each time through the while-loop. Is it possible that each node may not have the most recent copy of that list? I think so so I tried using a broadcast variable but those can't be updated (read only). I also tried using an accumulator but those are just for accumulations. I also tried to save the python lists as a file on hdfs for each node to have access to, but this didn't work well. Do you think I'm understanding the problem correctly? Is something else likely going on here? How can I get code that works fine on my local machine, but not on a cluster?

1 Answers

Answers 1

Thank you for all of the time and attention to this problem, especially since it sounds like I could have posted more information to make your jobs easier. The problem here is in

centroids = points.takeSample(False, K, 34) 

I didn't realize this, but after a short experiment, this function returns the same output each and every time, despite being what I thought was a random sample. As long as you use the same seed (34 in this case), you will get the same RDD in return. The RDD on my cluster was different for some reason than the one returned to my local machine. In any case, since it was the same RDD each time, my output never changed. The problem with the "random" centroids returned to me is that these particular ones gave rise to something like a saddle point in mathematics, where no convergence of the centroids would be found. This part of the answer is mathematical and a programming one, so I won't mention it further. My real hope at this point is that others are helped by the notion that if you want

centroids = points.takeSample(False, K, 34) 

to produce different samples each time it is called, that you change your seed each time to some random number.

I hope this all helps. I've never before spent so much time on a solution to my memory.

Thanks again.

Read More

Friday, April 1, 2016

Sorting a data stream before writing to file in nodejs

Leave a Comment

I have an input file which may potentially contain upto 1M records and each record would look like this

field 1 field 2 field3 \n

I want to read this input file and sort it based on field3 before writing it to another file.

here is what I have so far

var fs = require('fs'),     readline = require('readline'),     stream = require('stream');  var start = Date.now();  var outstream = new stream; outstream.readable = true; outstream.writable = true;  var rl = readline.createInterface({     input: fs.createReadStream('cross.txt'),     output: outstream,     terminal: false });  rl.on('line', function(line) {     //var tmp = line.split("\t").reverse().join('\t') + '\n';     //fs.appendFileSync("op_rev.txt", tmp );     // this logic to reverse and then sort is too slow });  rl.on('close', function() {     var closetime = Date.now();     console.log('Read entirefile. ', (closetime - start)/1000, ' secs'); }); 

I am basically stuck at this point, all I have is the ability to read from one file and write to another, is there a way to efficiently sort this data before writing it

4 Answers

Answers 1

DB and sort-stream are fine solutions, but DB might be an overkill and I think sort-stream eventually just sorts the entire file in an in-memory array (on through end callback), so I think performance will be roughly the same, comparing to the original solution.
(but I haven't ran any benchmarks, so I might be wrong).

So, just for the hack of it, I'll throw in another solution :)


EDIT: I was curious to see how big a difference this will be, so I ran some benchmarks.

Results were surprising even to me, turns out sort -k3,3 solution is better by far, x10 times faster then the original solution (a simple array sort), while nedb and sort-stream solutions are at least x18 times slower than the original solution (i.e. at least x180 times slower than sort -k3,3).

(See benchmark results below)


If on a *nix machine (Unix, Linux, Mac, ...) you can simply use
sort -k 3,3 yourInputFile > op_rev.txt and let the OS do the sorting for you.
You'll probably get better performance, since sorting is done natively.

Or, if you want to process the sorted output in Node:

var util = require('util'),     spawn = require('child_process').spawn,     sort = spawn('sort', ['-k3,3', './test.tsv']);  sort.stdout.on('data', function (data) {     // process data     data.toString()         .split('\n')         .map(line => line.split("\t"))         .forEach(record => console.info(`Record: ${record}`)); });  sort.on('exit', function (code) {     if (code) {         // handle error     }      console.log('Done'); });  // optional sort.stderr.on('data', function (data) {     // handle error...     console.log('stderr: ' + data); }); 

Hope this helps :)


EDIT: Adding some benchmark details.

I was curious to see how big a difference this will be, so I ran some benchmarks.

Here are the results (running on a MacBook Pro):

  • sort1 uses a straightforward approach, sorting the records in an in-memory array.
    Avg time: 35.6s (baseline)

  • sort2 uses sort-stream, as suggested by Joe Krill.
    Avg time: 11.1m (about x18.7 times slower)
    (I wonder why. I didn't dig in.)

  • sort3 uses nedb, as suggested by Tamas Hegedus.
    Time: about 16m (about x27 times slower)

  • sort4 only sorts by executing sort -k 3,3 input.txt > out4.txt in a terminal
    Avg time: 1.2s (about x30 times faster)

  • sort5 uses sort -k3,3, and process the response sent to stdout
    Avg time: 3.65s (about x9.7 times faster)

Answers 2

You can take advantage of streams for something like this. There's a few NPM modules that will be helpful -- first include them by running

npm install sort-stream csv-parse stream-transform 

from the command line.

Then:

var fs = require('fs'); var sort = require('sort-stream'); var parse = require('csv-parse'); var transform = require('stream-transform');  // Create a readble stream from the input file. fs.createReadStream('./cross.txt')   // Use `csv-parse` to parse the input using a tab character (\t) as the    // delimiter. This produces a record for each row which is an array of    // field values.   .pipe(parse({     delimiter: '\t'   }))   // Use `sort-stream` to sort the parsed records on the third field.    .pipe(sort(function (a, b) {     return a[2].localeCompare(b[2]);   }))   // Use `stream-transform` to transform each record (an array of fields) into    // a single tab-delimited string to be output to our destination text file.   .pipe(transform(function(row) {     return row.join('\t') + '\r';   }))   // And finally, output those strings to our destination file.   .pipe(fs.createWriteStream('./cross_sorted.txt')); 

Answers 3

You have two options, depending on how much data is being processed. (1M record count with 3 columns doesn't say much about the amount of actual data)

Load the data in memory, sort in place

var lines = []; rl.on('line', function(line) {     lines.push(line.split("\t").reverse()); });  rl.on('close', function() {     lines.sort(function(a, b) { return compare(a[0], b[0]); });      // write however you want     fs.writeFileSync(         fileName,         lines.map(function(x) { return x.join("\t"); }).join("\n")     );     function compare(a, b) {         if (a < b) return -1;         if (a > b) return 1;         return 0;     } }); 

Load the data in a persistent database, read ordered

Using a database engine of your choice (for example nedb, a pure javascript db for nodejs)

EDIT: It seems that NeDB keeps the whole database in memory, the file is only a persistent copy of the data. We'll have to search for another implementation. TingoDB looks promising.

// This code is only to give an idea, not tested in any way  var Datastore = require('nedb'); var db = new Datastore({     filename: 'path/to/temp/datafile',     autoload: true });  rl.on('line', function(line) {     var tmp = line.split("\t").reverse();     db.insert({         field0: tmp[0],         field1: tmp[1],         field2: tmp[2]     }); });  rl.on('close', function() {     var cursor = db.find({})             .sort({ field0: 1 }); // sort by field0, ascending     var PAGE_SIZE = 1000;     paginate(0);     function paginate(i) {         cursor.skip(i).take(PAGE_SIZE).exec(function(err, docs) {             // handle errors              var tmp = docs.map(function(o) {                 return o.field0 + "\t" + o.field1 + "\t" + o.field2 + "\n";             });             fs.appendFileSync("op_rev.txt", tmp.join(""));             if (docs.length >= PAGE_SIZE) {                 paginate(i + PAGE_SIZE);             } else {                 // cleanup temp database             }         });     } }); 

Answers 4

i had quite similar issue, needed to perform an external sort.

I figured out, after waste a few time on it that i could load up the data on a database and then query out the desired data from it.

It not even matter if the inserts aren't ordered, as long as my query result could be.

Hope it can work for you too.

In order to insert your data on a database, there are plenty of tools on node to perform such task. I have this pet project which does a similar job.

I'm also sure that if you search the subject, you'll find much more info.

Good luck.

Read More

Thursday, March 17, 2016

Cloudera Hadoop - Daemons not running

Leave a Comment

I'm self learning Hadoop and started of with installing Cloudera QuickVM on a VMware Workstation running CENT OS.

I was under the impression that Quickstart VM has most the of configurations predefined. Do I need to set up any other configurations to set up data and name node? Reason being when I type JFS I get only

[cloudera@quickstart bin]$ jps </n>  9480 RunJar </n>  18607 Jps 6952 </n>  org.eclipse.equinox.launcher_1.3.0.v20140415-2008.jar 

I don't see any data nodes nor name nodes.I looked it up online and solution was to start the daemons using start-all.sh. But I get an error

[cloudera@quickstart hadoop]$ start-all.sh  bash: start-all.sh: command not found 

Another round of research landed me in this

As suggested in that answer, when I ran bin/hadoop namenode -format

[cloudera@quickstart hadoop]$ bin/hadoop namenode -format  bash: bin/hadoop: No such file or directory 

I'm stuck in the middle of nowhere with motivation for self study getting really low. Please help with a suggestion and raise up my spirits.

P.S.: I'm an extreme beginner on Hadoop, so please don't be harsh on me if the question looks stupid to you :)

3 Answers

Answers 1

Are you sure you are in the correct folder? Try using ls to see the folder contents.

[cloudera@quickstart hadoop]$ start-all.sh bash: start-all.sh: command not found  [cloudera@quickstart hadoop]$ bin/hadoop namenode -format bash: bin/hadoop: No such file or directory 

This pretty much says that you are in the wrong folder. No such file or directory.

Make sure to learn the basics of using the bash shell, too. Such as ls -l start-all.sh to see if the file exists that you are trying to run; and to see the file permissions.

Answers 2

Finally, I found out how to start services on cloudera quickstart vm with some help from the community.

service hadoop-hdfs-namenode start 

Now when i run JPS, I can see all the daemons running,

[root@quickstart cloudera]# jps 2374 JobHistoryServer 2070 NameNode 3294 RunJar 4445 Bootstrap 4803  2947 -- process information unavailable 2196 SecondaryNameNode 1840 QuorumPeerMain 1908 DataNode 4836  3094 RunJar 3777 Master 2865 RESTServer 2594 ResourceManager 2327 Bootstrap 3663 Bootstrap 2451 NodeManager 1999 JournalNode 3111 Jps 3684 HistoryServer 4784 Bootstrap 

Thanks a lot for your attention.

Answers 3

Please run below mentioned command. It will give you list of locations.

find / -type f -name start-all.sh 

cd /path_name then try running command,

./start-all.sh 
Read More