

Lemire that you quote uses a benchmark in a very specific context: parsing integers.

The printing time should be negligible compared to the time it takes to actually read/parse the CSV, the 3 distinct counters require actually doing the parsing - preventing the optimizer from removing useful and invalidating the benchmark - and the counters can also serve as a crude sanity check to verify that the various parsers agree on the result of the parse.Īs for being fast, I'd honestly use mmap, and go for 0-copy. Possibly some minimal "hash", hash ^= cell.empty() ? '0' : cell Īnd then print those 3 numbers at the end. In order to verify the pure speed of your parser, I'd recommend instead using only stack variables in the verification layer:Ī counter of the number of columns (total across rows). The data is copied around, there are memory allocations, etc. The problem of your benchmark is that there's a lot of noise from using std::vector/ std::string.

Following this blog post by Daniel Lemire, I haven't bothered with mmap. Any tips on how to improve ifstream read speeds or tokenization would also be greatly appreciated. Specifically, I'd like to know if the performance measurements (see below) are competitive. Some CSV parsers provide performance measurements on either programmatically generated data or simple (now unavailable) test CSV files.įor this library, I decided to use publicly available datasets from Some CSV parsers are lenient w.r.t RFC-4180 compliance while others perform strict checking, sometimes throwing exceptions on compliance failures. It seems to be pretty hard to find benchmarks for (or comparisons between) existing CSV parsers in C++.Įach CSV parser (including this one) provides a different interface to read files and access rows. I wanted to see what performance could be achieved by parsing single-threaded and managing internal objects with std::string_view. It's great (and fast!) but requires the user to know a lot at compile time, e.g., column_count, column_names etc. I've used fast-cpp-csv-parser in the past. But, the library was poorly designed, became buggy, and was generally hard to maintain. I wrote a csv library last year and it turned out like crap.
