Tuesday, April 24, 2018

Does the performance of numpy differ depending on the operating system?

Leave a Comment

Reading the interesting book "From Python to Numpy" I met an example, the description of which is as follows:

Let's consider a simple example where we want to clear all the values from an array which has the dtype np.float32. How does one write it to maximize speed?

The provided results surprised me, and when I rechecked them I got a completely different behavior. So, I asked the author to double-check, but he received the same results as before (OS X 10) in the table below:

The variants were timed on three different computers: mine (Win10, Win7) and author's (OSX 10.13.3). With Python 3.6.4 and numpy 1.14.2, where each variant was timed for fixed 100 loops, best of 3.

Edit: This question is not about the fact that on different computers, with different characteristics, I get different times - this is obvious :) The question is that the behavior is very different on the two operating systems - which is not so obvious? (if it is of course so, I would be glad if someone could double-check).

The setup was: Z = np.ones(4*1000000, np.float32)

| Variant                     | Windows 10 | Ubuntu 17.10 | Windows 7 | OSX 10.13.3 | |                             |       computer 1          |   comp 2  |    comp 3   | | --------------------------- | ------------------------- | --------- | ----------- | | Z.view(np.float64)[...] = 0 | 758 usec   | 1.03 msec    | 2.72 msec | 1.01 msec   | | Z.view(np.float32)[...] = 0 | 757 usec   | 1.01 msec    | 2.61 msec | 1.58 msec   | | Z.view(np.float16)[...] = 0 | 760 usec   | 1.01 msec    | 2.62 msec | 2.85 msec   | | Z.view(np.complex)[...] = 0 | 1.06 msec  | 1.02 msec    | 3.26 msec | 918 usec    | | Z.view(np.int64)[...] = 0   | 758 usec   | 1.03 msec    | 2.69 msec | 1 msec      | | Z.view(np.int32)[...] = 0   | 757 usec   | 1.01 msec    | 2.62 msec | 1.46 msec   | | Z.view(np.int16)[...] = 0   | 760 usec   | 1.01 msec    | 2.63 msec | 2.87 msec   | | Z.view(np.int8)[...] = 0    | 758 usec   | 773 usec     | 2.68 msec | 614 usec    | | Z.fill(0)                   | 747 usec   | 998 usec     | 2.55 msec | N/A         | | Z[...] = 0                  | 750 usec   | 1 msec       | 2.59 msec | N/A         | 

As you can see from this table, on Windows the results doesn't depend on the viewed type, but on OS X this hack highly affects the performance. Can you provide the insight why this happens?

Edit: As I wrote above three computers are different.

The specs of the first computer: Windows 10 and Ubuntu 17.10
CPU: Intel Xenon E5-1650v4 3.60GHz
RAM: 128GB DDR4-2400

The specs of the second computer: Windows 7
CPU: Intel Pentium P6100 2.00GHz
RAM: 4GB DDR3-1333

The specs of the third computer: I don't have this information :)

Link to the issue

Edit 2: Add results for the first computer on Ubuntu 17.10.

1 Answers

Answers 1

Keep in mind that Python is a very high-level programming language, Pandas being also a high-level framework.

What you're essentially given to work with is a high level API for many operations that you can perform with the language, without the need to worry about the underlying implementation.

If you were to work with a lower-level API, to assign an array to a variable you'd have to allocate some memory, create a structure to hold your data, link it together (probably using pointers to memory addresses). And you didn't even touch the actual chip, there's still virtual memory mapping being done between your API and the actual data being saved to the chip. And that complexity is applied to basically everything you're doing with Python & Pandas.

Yet, you only have to do arr = [1, 2, 3], and not worry about it.

Now Python is expected to work the same on every platform you run it - at least in most of the cases.

Now, after the boring introduction is behind us - the whole idea of "expose uniform API, don't worry about implementation" is widely spread in computer programming. There are some subtle implementation details that differ one operating system from another, which may or may not impact performance of your software. I don't expect that to be significant, but it's still there and worth mentioning.

For example, there's an old answer about np.dot function performance differing between Linux and Windows. The author has way more knowledge than me on this subject, and points out that that particular function is a wrapper around CBLAS routines, which will use fastest routines available on given platform.

That being said - pandas is a very complex library which aims to make data analysis as simple as possible, through exposing simple-to-use API to the programmer. I expect that there are a lot more places where Pandas does a great job using the best mechanisms available on your platform to perform its tasks as fast as it can.

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment