Insane Coding: Reading in an entire file at once in C++, part 2

Tuesday, November 29, 2011

Reading in an entire file at once in C++, part 2

Posted by insane coder at Tuesday, November 29, 2011

Last week I discussed 6 different methods on how to quickly get an entire file into a C++ string.

The conclusion was that the straight forward method was the ideal method, this was verified across several compilers. Since then, people asked me a number of questions.

What if I want to use a vector instead of a string, are the speeds different?
Forget C++ containers, what about directly into a manually allocated buffer?
What about copying into a buffer via mmap?
What do these various cases say about compilers or their libraries? Can this indicate what they're good or bad at compiling?

So to establish some baseline numbers, I created an application which reads the same files the same amount of times using the fastest method possible. It directly creates a buffer aligned to the partition's blocking factor, informs the OS to do the fastest possible sequential read, and then reads the file directly into the buffer. I also tested the suggested method to mmap a file, and copy the contents from there to an allocated buffer. The times achieved for these were 5 and 9 respectively. The latter is slower because of the extra memcpy() required, you want to read directly into your destination buffer whenever possible. The fastest time I now achieved should more or less be the fastest theoretical limit I'm able to get with my hardware. Both GCC and LLVM got the exact same times. I did not test with VC++ here as it doesn't support POSIX 2008.

Now regarding our 6 methods from last time, all of them except the Rdbuf method are possible with std::vector, since there is no std::vector based std::istream.

An interesting thing to note is that C++ string implementations vary greatly from implementation to implementation. Some offer optimizations for very small strings, some offer optimizations for frequent copies, by using reference counting. Some always ensure the string ends with a 0 byte so you can immediately pass them as a C string. In this latter case, operations which operate on strings as a range are rather quick, as the data is copied, then a 0 is appended. Whereas a loop which constantly pushes bytes on the back will have to needlessly set the extra trailing byte to 0 each time. Vector implementations on the other hand don't need to worry about a trailing 0 byte, and generally don't try to internally use all kinds of elaborate storage methods. So if std::vector works for you, you may want to use that.

Let's review times for the 5 applicable methods with our various compilers.

GCC 4.6 with a vector:

Method	Duration
C	23.5
C++	22.8
Iterator	73
Assign	81.8
Copy	68

Whereas with a string:

Method	Duration
C	24.5
C++	24.5
Iterator	64.5
Assign	68
Copy	63

We see that with a vector, the basic methods became a bit faster, but interestingly enough, the others got slower. However, which methods are superior to the others have remained the same.

Now for LLVM 3 with a vector:

Method	Duration
C	8
C++	8
Iterator	860
Assign	1328
Copy	930

Versus for string:

Method	Duration
C	7.5
C++	7.5
Iterator	110
Assign	102
Copy	97

With LLVM, everything is slower with a vector, and for the more complex solutions, much much slower. There's two interesting things we can see about LLVM though. For more straight forward logic, their compiler's optimizations are extremely smart. The speeds approach the theoretical best. I did some profiling on GCC and LLVM, as they're using the same C and C++ libraries, and found that in the straight C/C++ methods for my test program, GCC made 300 memory allocations, but LLVM made only 200. LLVM apparently is smart enough to see inside the various allocations, skip the ones that aren't needed, and place the data directly into the output buffer. But for complex code, LLVM's optimizations aren't that great. In the case of vectors and iterators, downright awful. Someone should file some bug reports with them.

Now for Visual C++ 2010 using vector:

Method	Duration
C	17.8
C++	18.7
Iterator	180.6
Assign	159.5
Copy	165.6

And string:

Method	Duration
C	16.5
C++	20.4
Iterator	224.4
Assign	222.8
Copy	320

We see here that the Copy method, which uses push_back() got a huge performance improvement. This seems to indicate that the STL implementation adds a 0 at the end of each operation, especially push_back(), instead of just when c_str() is called. Otherwise, string is faster.

It's also sad to see that GCC while winning all the cases where iterators were involved, was significantly slower in all the straight forward cases. This seems to indicate that GCC has the smartest optimizations, but fails to optimize well when the logic is straightforward. Someone should look into that.

It seems if you're trying to hold a collection of bytes, or whatever your wchar_t is, but don't care about the specialties of any particular container, as long as you don't push_back() a lot, string seems to be faster.

Finally, here's a table of all the compilers and methods I tested ordered by speed:

Method	Duration
POSIX	5
LLVM 3.0 s C/C++	7.5
LLVM 3.0 v C/C++	8
MMAP	9
VC 2010 s C	16.5
VC 2010 v C	17.8
VC 2005 s C	18.3
VC 2010 v C++	19.7
VC 2010 s C++	20.4
VC 2005 s C++	21
GCC 4.6 v C++	22.8
GCC 4.6 v C	23.5
VC 2005 v C	24
GCC 4.6 s C/C++	24.5
VC 2005 v C++	26
LLVM 3.0 s Rdbuf	31.5
GCC 4.6 s Rdbuf	32.5
GCC 4.6 s Copy	63
GCC 4.6 s Iterator	64.5
GCC 4.6 s Assign	68
GCC 4.6 v Copy	68
GCC 4.6 v Iterator	73
GCC 4.6 v Assign	81.8
LLVM 3.0 s Copy	97
LLVM 3.0 s Assign	102
LLVM 3.0 s Iterator	110
VC 2010 v Assign	159.5
VC 2010 v Copy	165.6
VC 2005 v Copy	172
VC 2010 s Rdbuf	176.2
VC 2010 v Iterator	180.6
VC 2005 s Rdbuf	199
VC 2005 s Iterator	209.3
VC 2005 s Assign	221
VC 2010 s Assign	222.8
VC 2010 s Iterator	224.4
VC 2010 s Copy	320
VC 2005 v Iterator	370
VC 2005 v Assign	378
VC 2005 s Copy	483.5
LLVM 3.0 v Iterator	860
LLVM 3.0 v Copy	930
LLVM 3.0 v Assign	1328

15 comments:

iCrazy said...: Excellent follow-up. Thanks for sharing!e; December 19, 2011 at 9:29 PM
BrettB said...: Could you post a snippet of your POSIX code?; October 31, 2014 at 10:14 AM
insane coder said...: Hi BrettB,

I'm not sure I still even have the test code I wrote for this article.

However, the concept behind fast POSIX file to memory usage involves the following:
Direct file access functions (the classic open/read/close).
Best alignment of data in memory and multiple of the file's blocking factor (posix_memalign and fstat with st_blksize).
Informing the OS of I/O strategy (posix_fadvise and posix_madvise).

In terms of getting the entire file, and a C++ string, then a lot of the details above aren't relavent. But you'll still want open, posix_fadvise, read, and close, to setup a fast no nonsense read with as little overhead as possible.; November 5, 2014 at 5:11 AM
snorlax said...: @insane coder (& BrettB)

I took a stab at the POSIX method; does this snippet look correct?

https://gist.github.com/rayhamel/1823976595c08d26e6576a36e4688e87; August 15, 2017 at 2:48 PM
insane coder said...: I haven't reviewed all of it, but the general gist of it looks correct.; April 29, 2018 at 11:47 AM
Anonymous said...: Thanks for all the amazing articles that you been sharing all this time. Please keep posting such good quality content on regular because they are awesome.
new xxx stories; January 17, 2021 at 1:05 PM
oncadaycomm said...: Hard to ignore such an amazing article like this.
You really amazed me with your writing talent.
Thank you for sharing again. 카지노
(mm); September 17, 2021 at 1:54 PM
massage.blue said...: Nice article. I like the part where you mentioned that a good comment is one where I have to pay attention to the article. However, I am not sure where you were going with that concept. Can you explain it to me further? The part that I did not like was in order to have a good comment on the blog post, I have to point out what I liked about your post. What if I did not like the blogpost. Can you explain it to me as well?

타이마사지; September 20, 2021 at 12:08 AM
gunma.top said...: Howdy superb website! Does running a blog similar to this require a great deal of work? I’ve absolutely no understanding of coding however I had been hoping to start my own blog soon.

건전마사지; September 20, 2021 at 12:16 AM
MBBS in Philippines said...: Wisdom Overseasis authorized India's Exclusive Partner of Southwestern University PHINMA, the Philippines established its strong trust in the minds of all the Indian medical aspirants and their parents. Under the excellent leadership of the founder Director Mr. Thummala Ravikanth, Wisdom meritoriously won the hearts of thousands of future doctors and was praised as the “Top Medical Career Growth Specialists" among Overseas Medical Education Consultants in India.

Southwestern University PHINMAglobally recognized university in Cebu City, the Philippines facilitating educational service from 1946. With the sole aim of serving the world by providing an accessible, affordable, and high-quality education to all the local and foreign students. SWU PHINMA is undergoing continuous changes and shaping itself as the best leader with major improvements in academics, technology, and infrastructure also in improving the quality of student life.; September 22, 2021 at 3:43 AM
Zea said...: HD Video Player; November 10, 2021 at 6:18 AM
Easy Loan Mart said...: Hi....
Call open() method to open a file “tpoint. txt” to perform read operation using object newfile. If file is open then Declare a string “tp”. Read all data of file object newfile using getline() method and put it into the string tp.
You are also read more How to get Instant Loan; December 13, 2021 at 10:26 PM
baccaratsite.top said...: I will recommend your website to everyone. You have a very good gloss. Write more high-quality articles. I support you.
온라인카지노; December 20, 2021 at 2:08 AM
casinositewikicom said...: I like the helpful info you provide in your articles. I’ll bookmark your blog and check again here frequently. I’m quite sure I’ll learn plenty of new stuff right here! 카지노; December 22, 2021 at 4:49 AM
SAFETOTOSITEPRO18 said...: This is one of the best website I have seen in a long time thank you so much. 바둑이사이트; March 25, 2022 at 1:41 AM