Monday, October 2, 2017

How to open a huge excel file efficiently

Leave a Comment

I have a 150MB one-sheet excel file that takes about 7 minutes to open on a very powerful machine using the following:

# using python import xlrd wb = xlrd.open_workbook(file) sh = wb.sheet_by_index(0) 

Is there any way to open the excel file quicker? I'm open to even very outlandish suggestions (such as hadoop, spark, c, java, etc.). Ideally I'm looking for a way to open the file in under 30 seconds if that's not a pipe dream. Also, the above example is using python, but it doesn't have to be python.


Note: this is an Excel file from a client. It cannot be converted into any other format before we receive it. It is not our file


UPDATE: Answer with a working example of code that will open the following 200MB excel file in under 30 seconds will be rewarded with bounty: https://drive.google.com/file/d/0B_CXvCTOo7_2VW9id2VXRWZrbzQ/view?usp=sharing. This file should have string (col 1), date (col 9), and number (col 11).

12 Answers

Answers 1

Most programming languages that work with Office products have some middle layer and this is usually where the bottleneck is, a good example is using PIA's/Interop or Open XML SDK.

One way to get the data at a lower level (bypassing the middle layer) is using a Driver.

150MB one-sheet excel file that takes about 7 minutes.

The best I could do is a 130MB file in 135 seconds, roughly 3 times faster:

Stopwatch sw = new Stopwatch(); sw.Start();  DataSet excelDataSet = new DataSet();  string filePath = @"c:\temp\BigBook.xlsx";  // For .XLSXs we use =Microsoft.ACE.OLEDB.12.0;, for .XLS we'd use Microsoft.Jet.OLEDB.4.0; with  "';Extended Properties=\"Excel 8.0;HDR=YES;\""; string connectionString = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source='" + filePath + "';Extended Properties=\"Excel 12.0;HDR=YES;\"";  using (OleDbConnection conn = new OleDbConnection(connectionString)) {     conn.Open();     OleDbDataAdapter objDA = new System.Data.OleDb.OleDbDataAdapter     ("select * from [Sheet1$]", conn);     objDA.Fill(excelDataSet);     //dataGridView1.DataSource = excelDataSet.Tables[0]; } sw.Stop(); Debug.Print("Load XLSX tool: " + sw.ElapsedMilliseconds + " millisecs. Records = "  + excelDataSet.Tables[0].Rows.Count); 

enter image description here

Win 7x64, Intel i5, 2.3ghz, 8GB ram, SSD250GB.

If I could recommend a hardware solution as well, try to resolve it with an SSD if you're using standard HDD's.

Note: I cant download your Excel spreadsheet example as I'm behind a corporate firewall.

PS. See MSDN - Fastest Way to import xlsx files with 200 MB of Data, the consensus being OleDB is the fastest.

PS 2. Here's how you can do it with python: http://code.activestate.com/recipes/440661-read-tabular-data-from-excel-spreadsheets-the-fast/

Answers 2

I managed to read the file in about 30 seconds using .NET core and the Open XML SDK.

The following example returns a list of objects containing all rows and cells with the matching types, it supports date, numeric and text cells. The project is available here: https://github.com/xferaa/BigSpreadSheetExample/ (Should work on Windows, Linux and Mac OS and does not require Excel or any Excel component to be installed).

public List<List<object>> ParseSpreadSheet() {     List<List<object>> rows = new List<List<object>>();      using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(filePath, false))     {         WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;         WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();          OpenXmlReader reader = OpenXmlReader.Create(worksheetPart);          Dictionary<int, string> sharedStringCache = new Dictionary<int, string>();          int i = 0;         foreach (var el in workbookPart.SharedStringTablePart.SharedStringTable.ChildElements)         {             sharedStringCache.Add(i++, el.InnerText);         }          while (reader.Read())         {             if(reader.ElementType == typeof(Row))             {                 reader.ReadFirstChild();                  List<object> cells = new List<object>();                  do                 {                     if (reader.ElementType == typeof(Cell))                     {                         Cell c = (Cell)reader.LoadCurrentElement();                          if (c == null || c.DataType == null || !c.DataType.HasValue)                             continue;                          object value;                          switch(c.DataType.Value)                         {                             case CellValues.Boolean:                                 value = bool.Parse(c.CellValue.InnerText);                                 break;                             case CellValues.Date:                                 value = DateTime.Parse(c.CellValue.InnerText);                                 break;                             case CellValues.Number:                                 value = double.Parse(c.CellValue.InnerText);                                 break;                             case CellValues.InlineString:                             case CellValues.String:                                 value = c.CellValue.InnerText;                                 break;                             case CellValues.SharedString:                                 value = sharedStringCache[int.Parse(c.CellValue.InnerText)];                                 break;                             default:                                 continue;                         }                          if (value != null)                             cells.Add(value);                     }                  } while (reader.ReadNextSibling());                  if (cells.Any())                     rows.Add(cells);             }         }     }      return rows; } 

I ran the program in a three year old Laptop with a SSD drive, 8GB of RAM and an Intel Core i7-4710 CPU @ 2.50GHz (two cores) on Windows 10 64 bits.

Note that although opening and parsing the whole file as strings takes a bit less than 30 seconds, when using objects as in the example of my last edit, the time goes up to almost 50 seconds with my crappy laptop. You will probably get closer to 30 seconds in your server with Linux.

The trick was to use the SAX approach as explained here:

https://msdn.microsoft.com/en-us/library/office/gg575571.aspx

Answers 3

Well, if your excel is going to be as simple as a CSV file like your example (https://drive.google.com/file/d/0B_CXvCTOo7_2UVZxbnpRaEVnaFk/view?usp=sharing), you can try to open the file as a zip file and read directly every xml:

It takes to open and read the example file about 35 seconds (200MB) with an HDD, with SDD takes a little less (30 seconds).

Here the code: https://github.com/csaki/OpenSimpleExcelFast.git

import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.io.PrintWriter; import java.nio.charset.Charset; import java.time.LocalDateTime; import java.time.format.DateTimeFormatter; import java.util.concurrent.ExecutionException; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; import java.util.concurrent.Future; import java.util.zip.ZipFile;  public class Launcher {      public static final char CHAR_END = (char) -1;      public static void main(String[] args) throws IOException, ExecutionException, InterruptedException {         long init = System.currentTimeMillis();         String excelFile = "D:/Downloads/BigSpreadsheet.xlsx";         ZipFile zipFile = new ZipFile(excelFile);          ExecutorService executor = Executors.newFixedThreadPool(4);         Future<String[]> futureWords = executor.submit(() -> processSharedStrings(zipFile));         Future<Object[][]> futureSheet1 = executor.submit(() -> processSheet1(zipFile));         String[] words = futureWords.get();         Object[][] sheet1 = futureSheet1.get();         executor.shutdown();          long end = System.currentTimeMillis();         System.out.println("Main only open and read: " + (end - init) / 1000);           ///Doing somethin with the file::Saving as csv         init = System.currentTimeMillis();         try (PrintWriter writer = new PrintWriter(excelFile + ".csv", "UTF-8");) {             for (Object[] rows : sheet1) {                 for (Object cell : rows) {                     if (cell != null) {                         if (cell instanceof Integer) {                             writer.append(words[(Integer) cell]);                         } else if (cell instanceof String) {                             writer.append(toDate(Double.parseDouble(cell.toString())));                         } else {                             writer.append(cell.toString()); //Probably a number                         }                     }                     writer.append(";");                 }                 writer.append("\n");             }         }         end = System.currentTimeMillis();         System.out.println("Main saving to csv: " + (end - init) / 1000);     }      private static final DateTimeFormatter formatter = DateTimeFormatter.ISO_DATE_TIME;     private static final LocalDateTime INIT_DATE = LocalDateTime.parse("1900-01-01T00:00:00+00:00", formatter).plusDays(-2);      //The number in excel is from 1900-jan-1, so every number time that you get, you have to sum to that date     public static String toDate(double s) {         return formatter.format(INIT_DATE.plusSeconds((long) ((s*24*3600))));     }      public static Object[][] processSheet1(ZipFile zipFile) throws IOException {         String entry = "xl/worksheets/sheet1.xml";         Object[][] result = null;         char[] dimensionToken = "dimension ref=\"".toCharArray();         char[] tokenOpenC = "<c r=\"".toCharArray();         char[] tokenOpenV = "<v>".toCharArray();          char[] tokenAttributS = " s=\"".toCharArray();         char[] tokenAttributT = " t=\"".toCharArray();         try (BufferedReader br = new BufferedReader(new InputStreamReader(zipFile.getInputStream(zipFile.getEntry(entry)), Charset.forName("UTF-8")))) {             String dimension = extractNextValue(br, dimensionToken, '"');             int[] sizes = extractSizeFromDimention(dimension.split(":")[1]);             br.skip(30); //Between dimension and next tag c exists more or less 30 chars             result = new Object[sizes[0]][sizes[1]];             String v;             while ((v = extractNextValue(br, tokenOpenC, '"')) != null) {                 int[] indexes = extractSizeFromDimention(v);                  int s = foundNextTokens(br, '>', tokenAttributS, tokenAttributT);                 char type = 's'; //3 types: number (n), string (s) and date (d)                 if (s == 0) { // Token S = number or date                     char read = (char) br.read();                     if (read == '1') {                         type = 'n';                     } else {                         type = 'd';                     }                 } else if (s == -1) {                     type = 'n';                 }                 String c = extractNextValue(br, tokenOpenV, '<');                 Object value = null;                 switch (type) {                     case 'n':                         value = Double.parseDouble(c);                         break;                     case 's':                         value = Integer.parseInt(c);                         break;                     case 'd':                         value = c.toString();                         break;                 }                 result[indexes[0] - 1][indexes[1] - 1] = value;                 br.skip(7); ///v></c>             }         }         return result;     }      public static int[] extractSizeFromDimention(String dimention) {         StringBuilder sb = new StringBuilder();         int columns = 0;         int rows = 0;         for (char c : dimention.toCharArray()) {             if (columns == 0) {                 if (Character.isDigit(c)) {                     columns = convertExcelIndex(sb.toString());                     sb = new StringBuilder();                 }             }             sb.append(c);         }         rows = Integer.parseInt(sb.toString());         return new int[]{rows, columns};     }      public static String[] processSharedStrings(ZipFile zipFile) throws IOException {         String entry = "xl/sharedStrings.xml";         String[] words = null;         char[] wordCount = "Count=\"".toCharArray();         char[] token = "<t>".toCharArray();         try (BufferedReader br = new BufferedReader(new InputStreamReader(zipFile.getInputStream(zipFile.getEntry(entry)), Charset.forName("UTF-8")))) {             String uniqueCount = extractNextValue(br, wordCount, '"');             words = new String[Integer.parseInt(uniqueCount)];             String nextWord;             int currentIndex = 0;             while ((nextWord = extractNextValue(br, token, '<')) != null) {                 words[currentIndex++] = nextWord;                 br.skip(11); //you can skip at least 11 chars "/t></si><si>"             }         }         return words;     }      public static int foundNextTokens(BufferedReader br, char until, char[]... tokens) throws IOException {         char character;         int[] indexes = new int[tokens.length];         while ((character = (char) br.read()) != CHAR_END) {             if (character == until) {                 break;             }             for (int i = 0; i < indexes.length; i++) {                 if (tokens[i][indexes[i]] == character) {                     indexes[i]++;                     if (indexes[i] == tokens[i].length) {                         return i;                     }                 } else {                     indexes[i] = 0;                 }             }         }          return -1;     }      public static String extractNextValue(BufferedReader br, char[] token, char until) throws IOException {         char character;         StringBuilder sb = new StringBuilder();         int index = 0;          while ((character = (char) br.read()) != CHAR_END) {             if (index == token.length) {                 if (character == until) {                     return sb.toString();                 } else {                     sb.append(character);                 }             } else {                 if (token[index] == character) {                     index++;                 } else {                     index = 0;                 }             }         }         return null;     }      public static int convertExcelIndex(String index) {         int result = 0;         for (char c : index.toCharArray()) {             result = result * 26 + ((int) c - (int) 'A' + 1);         }         return result;     }  } 

Answers 4

Python's Pandas library could be used to hold and process your data, but using it to directly load the .xlsx file will be quite slow, e.g. using read_excel().

One approach would be to use Python to automate the conversion of your file into CSV using Excel itself and to then use Pandas to load the resulting CSV file using read_csv(). This will give you a good speed up, but not under 30 seconds:

import win32com.client as win32         import pandas as pd     from datetime import datetime      print ("Starting") start = datetime.now()  # Use Excel to load the xlsx file and save it in csv format excel = win32.gencache.EnsureDispatch('Excel.Application') wb = excel.Workbooks.Open(r'c:\full path\BigSpreadsheet.xlsx') excel.DisplayAlerts = False wb.DoNotPromptForConvert = True wb.CheckCompatibility = False  print('Saving') wb.SaveAs(r'c:\full path\temp.csv', FileFormat=6, ConflictResolution=2)  excel.Application.Quit()  # Use Pandas to load the resulting CSV file print('Loading CSV') df = pd.read_csv(r'c:\full path\temp.csv', dtype=str)  print(df.shape) print("Done", datetime.now() - start) 

Column types
The types for your columns can be specified by passing dtype and converters and parse_dates:

df = pd.read_csv(r'c:\full path\temp.csv', dtype=str, converters={10:int}, parse_dates=[8], infer_datetime_format=True) 

You should also specify infer_datetime_format=True, as this will greatly speed up the date conversion.

nfer_datetime_format : boolean, default False

If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by 5-10x.

Also add dayfirst=True if dates are in the form DD/MM/YYYY.

Selective columns
If you actually only need to work on columns 1 9 11, then you could further reduce resources by specifying usecols=[0, 8, 10] as follows:

df = pd.read_csv(r'c:\full path\temp.csv', dtype=str, converters={10:int}, parse_dates=[1], dayfirst=True, infer_datetime_format=True, usecols=[0, 8, 10]) 

The resulting dataframe would then only contain those 3 columns of data.

RAM drive
Using a RAM drive to store the temporary CSV file to would further speed up the load time.

Note: This does assume you are using a Windows PC with Excel available.

Answers 5

Looks like it is hardly achievable in Python at all. If we unpack a sheet data file then it would take all required 30 seconds just to pass it through the C-based iterative SAX parser (using lxml, a very fast wrapper over libxml2):

from __future__ import print_function  from lxml import etree import time   start_ts = time.time()  for data in etree.iterparse(open('xl/worksheets/sheet1.xml'), events=('start',),                              collect_ids=False, resolve_entities=False,                             huge_tree=True):     pass  print(time.time() - start_ts) 

The sample output: 27.2134890556

By the way, the Excel itself needs about 40 seconds to load the workbook.

Answers 6

I'm using a Dell Precision T1700 workstation and using c# I was able to open the file and read it's contents in about 24 seconds just using standard code to open a workbook using interop services. Using references to the Microsoft Excel 15.0 Object Library here is my code.

My using statements:

using System.Runtime.InteropServices; using Excel = Microsoft.Office.Interop.Excel; 

Code to open and read workbook:

public partial class MainWindow : Window {     public MainWindow() {         InitializeComponent();          Excel.Application xlApp;         Excel.Workbook wb;         Excel.Worksheet ws;          xlApp = new Excel.Application();         xlApp.Visible = false;         xlApp.ScreenUpdating = false;          wb = xlApp.Workbooks.Open(@"Desired Path of workbook\Copy of BigSpreadsheet.xlsx");          ws = wb.Sheets["Sheet1"];          //string rng = ws.get_Range("A1").Value;         MessageBox.Show(ws.get_Range("A1").Value);          Marshal.FinalReleaseComObject(ws);          wb.Close();         Marshal.FinalReleaseComObject(wb);          xlApp.Quit();         Marshal.FinalReleaseComObject(xlApp);          GC.Collect();         GC.WaitForPendingFinalizers();     } } 

Answers 7

The c# and ole solution still have some bottleneck.So i test it by c++ and ado.

_bstr_t connStr(makeConnStr(excelFile, header).c_str());  TESTHR(pRec.CreateInstance(__uuidof(Recordset)));        TESTHR(pRec->Open(sqlSelectSheet(connStr, sheetIndex).c_str(), connStr, adOpenStatic, adLockOptimistic, adCmdText));  while(!pRec->adoEOF) {     for(long i = 0; i < pRec->Fields->GetCount(); ++i)     {            _variant_t v = pRec->Fields->GetItem(i)->Value;         if(v.vt == VT_R8)             num[i] = v.dblVal;         if(v.vt == VT_BSTR)             str[i] = v.bstrVal;                   ++cellCount;     }                                         pRec->MoveNext(); } 

In i5-4460 and HDD machine,i find 500 thousands of cell in xls will take 1.5s.But same data in xlsx will take 2.829s.so it's possible for manipulating your data under 30s.

If you really need under 30s,use RAM Drive to reduce file IO.It will significantly improve your process. I cannot download your data to test it,so please tell me the result.

Answers 8

I have created an sample Java program which is able to load the file in ~40 seconds my laptop ( Intel i7 4 core, 16 GB RAM).

https://github.com/skadyan/largefile

This program uses the Apache POI library to load the .xlsx file using the XSSF SAX API.

The callback interface com.stackoverlfow.largefile.RecordHandler implementation can be used to process the data loaded from the excel. This interface define only one method which take three arguments

  • sheetname : String, excel sheet name
  • row number: int, row number of data
  • and data map: Map: excel cell reference and excel formatted cell value

The class com.stackoverlfow.largefile.Main demonstrate one basic implementation of this interface which just print the row number on console.

Update

woodstox parser seems have better performance than standard SAXReader. (code updated in repo).

Also in order to meet the desired performance requirement, you may consider to re-implement the org.apache.poi...XSSFSheetXMLHandler. In the implementation, more optimized string/text value handling can be implemented and unnecessary text formatting operation may be skipped.

Answers 9

I would like to have more info about the system where you are opening the file... anyway:

look in your system for a Windows update called
"Office File Validation Add-In for Office ..."

if you have it... uninstall it...
the file should load much more quickly
specially if is loaded froma share

Answers 10

Another way that should improve largely the load/operation time is a RAMDrive

create a RAMDrive with enough space for your file and a 10%..20% extra space...
copy the file for the RAMDrive...
Load the file from there... depending on your drive and filesystem the speed improvement should be huge...

My favourite is IMDisk toolkit
(https://sourceforge.net/projects/imdisk-toolkit/) here you have a powerfull command line to script everything...

I also recommend SoftPerfect ramdisk
(http://www.majorgeeks.com/files/details/softperfect_ram_disk.html)

but that also depends of your OS...

Answers 11

Save your excelsheet to a tab delimited file and open it as you'd normally read a plain txt :)

edit: You can then read the file line by line and split the lines at tabs. Get the data columns you need by index.

Answers 12

Have you tried loading the worksheet on demand, which available since version 0.7.1 of xlrd?

To do this you need to pass on_demand=True to open_workbook().

xlrd.open_workbook(filename=None, logfile=<_io.TextIOWrapper name='' mode='w' encoding='UTF-8'>, verbosity=0, use_mmap=1, file_contents=None, encoding_override=None, formatting_info=False, on_demand=False, ragged_rows=False)


Other potential python solutions I found for reading an xlsx file:

  • Read the raw xml in 'xl/sharedStrings.xml' and 'xl/worksheets/sheet1.xml'
  • Try the openpyxl library's Read Only mode which claims too be optimized in memory usage for large files.

    from openpyxl import load_workbook wb = load_workbook(filename='large_file.xlsx', read_only=True) ws = wb['big_data']  for row in ws.rows:     for cell in row:         print(cell.value) 
  • If you are running on Windows you could use PyWin32 and 'Excel.Application'

    import time import win32com.client as win32 def excel():    xl = win32.gencache.EnsureDispatch('Excel.Application')    ss = xl.Workbooks.Add() ... 
If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment