Pubby (pubby8@gmail.com) Aug 15 2016

Flat Containers the C++ Way

Introduction

Flat Containers (P0038R0) by Sean Middleditch gave a good overview of what flat containers are and how might they be designed. This paper aims to answer the questions posed in Sean's paper, focusing-in on a specific design for the standard.

Code

A coded implementation can be found at https://github.com/pubby/flat.

Cheatsheet

New headers

<flat_map>
<flat_set>

New types

std::flat_map
std::flat_multimap
std::flat_set
std::flat_multiset

std::vector_map
std::vector_multimap
std::vector_set
std::vector_multiset

New free functions

binary_find
- Returns iterator to element if found, otherwise returns last.
- Part of <algorithm>.

New member functions

has
- map.has(key) returns a pointer to key's mapped value if it exists, otherwise returns null.
insert (delayed sort)
- map.insert(first, last, std::delayed_sort) performs insertion with a delayed sort optimization.
Constructors (delayed sort overload)
- Constructs flat container using a delayed sort optimization.
Constructors (container construct overload)
- Constructs flat container by forwarding arguments to the underlying container member.

New public data members

container
- The flat container's underlying container.
iterator::underlying
- The iterator's underlying container iterator.

Flat containers in 3 examples

1. Flat containers are container adaptors

Any vector-like sequence container can be used by flat containers:

std::flat_set<std::vector<int>> vecset;
std::flat_map<std::vector<std::pair<int, double>>> vecmap;
std::flat_set<std::deque<char>> deqset;

// These next containers use the stack to allocate - Neat!

std::flat_set<boost::container::static_vector<double, 64>> staticset;
std::flat_set<boost::container::small_vector<std::string, 32>> smallset;

// Custom containers work great too.

std::flat_set<GameCompany::GameVector<GameInt>> gameSet;

For convenience, the typedefs vector_set and vector_map are provided as aliases for std::vector flat containers.

std::vector_set<int> vecset;
std::vector_map<int, double> vecmap;

Use these typedefs all the time.

2. Flat containers use the same interface as `map` and `set`

There's nothing to learn because you already know how to use flat containers.

std::vector_map<std::string, int> map;

map.insert(std::make_pair("Hello World", 42));
map.insert(std::make_pair("insert is great", 10));

map.erase("insert is great");

map.emplace("emplace is better", 11);

for(auto const& pair : map)
    std::cout << pair.first << ": " << pair.second << std::endl;

map.clear();

3. Flat containers expose their underlying container with a public data member

Use container to gain access to the underlying container. This works because flat containers are just light wrappers around their contained containers.

std::vector_set<int> set;

// vector_set does not have a reserve member, but that's not a problem with 'container'.

set.container.reserve(1000);

for(int i = 0; i != 1000; ++i)
    set.insert(i + i/2);

// operator[] on 'container' is convenient to have.

for(int i = 0; i != set.container.size(); ++i)
    set.container[i] *= 2;

There are rules regarding container, but we'll get into those later.

Design considerations

Do what programmers expect.
- Use the same interface as standard map/set if possible.
- Define operations in terms of existing standard functions.
Be worth using over plain vectors.
- There should be little reason to use plain sorted vectors as maps when flat_map exists.
Be easy to migrate to.
- Lots of code out there uses Boost/Loki/EASTL flat containers already.
- An improved design gives people reason to migrate.
Support embedded and memory-constrained programmers.
- Stack-allocated vector implementations are key.

Performance considerations

There are three cases where flat containers beat out everything else, bar none:

In-order iteration.
Whole-container copying.
Memory usage.

Note that "searching" is not include on the list. In general, the O(1) average-case performance of hash tables will always beat flat containers when it comes to searching, insertion, and removal. Flat containers do have their advantages over hash tables when it comes to worst-case performance, but rarely are they optimal.

The graph below shows the performance of associate containers at sizes less than 256. Note that hash tables perform the best and linear search (displayed as 'unsorted') performs the worst. The two flat containers are displayed in black and cyan.

(Side note: Linear search is really, really terrible and should stop being recommended to newbies as a faster alternative to associative containers. It's not a worthwhile optimization to make; it's just stupid.)

Internal algorithms

Array Layouts for Comparison-Based Searching

"After extensive testing and tuning on a wide variety of modern hardware, we arrive at the conclusion that, for small values of n, sorted order, combined with a good implementation of binary search is best. For larger values of n, we arrive at the surprising conclusion that the Eytzinger [breadth-first] layout is usually the fastest."

That quotation is from the abstract of an absolutely fantastic paper by Pat Morin and Paul-Virak Khuong called Array Layouts for Comparison-Based Searching, which details the different representations of flat sorted arrays and their influence upon performance. The paper is based on a series of binary search benchmarks written by Pat and ran by others on various types of hardware. Graphs made from these benchmarks are availible on Pat Morin's website.

Here is one such graph which shows the performance of 32-bit data on an Intel i7-4790K:

As you can see, branch-free binary search performs best for sizes that fit in the cache, but Eytzinger (breadth-first) is the clear winner past that.

A brief explanation of these results

It's all due to bandwidth, prefetching, and speculative loads. It is not due to "having fewer cache misses".

Two key properties to take note of:

The Eytzinger layout causes child and grandchild (and great-grandchild, etc) indices to share the same cacheline.
On modern CPUs, requesting up to 4 cachelines of memory can have the same latency as requesting just one. Multiple requests can be outgoing at once.

The Eytzinger search algorithm is fast because it loads memory it might need several iterations in advance. Property #1 allows this prefetching to be efficient; all possible grandchildren of a given node exist in just one or two cachelines. Property #2 provides the speedup; a small number of prefetches (four) can be outgoing at once without increases in latency.

A detailed explanation of these results

Read the paper!

Rejecting Eytzinger

Unfortunately, the Eytzinger layout is not useful as a general-purpose container. Despite being fast at searches, Eytzinger's other map-like operations slow and complicated.

Full implementations of the Eytzinger layout do not exist, but performance numbers can be estimated from a benchmark done by Joaquín M López Muñoz. This benchmark compares the in-order traversal time of Eytzinger layouts (referred to as level-order) to sorted vectors. The Eytzinger layout took five times as long when doing this traversal.

Assumptions can be made about the speed of insertion, removal, and in-order construction based on these results. Since Eytzinger layouts are perfectly balanced binary trees, such operations must do additional work to re-balance the tree. This requires at least O(N) complexity for traversing the tree, and that implies a constant factor which is at least five times that of sorted vectors.

Sorted vectors are strongly preferred from a standard library perspective. They offer good - but not optimal - performance accross all of their operations, and can be implemented using pre-existing library functions. Sorted vectors are the best choice for this proposal.

One Vector or Two?

There are two ways of storing sorted key/value pairs:

Interleaved into one vector.
Separated into two vectors.

The graphs below show the performance of searching maps with these two representations.

8 byte keys / 8 byte mapped values:

8 byte keys / 16 byte mapped values:

8 byte keys / 256 byte mapped values:

(The y-axis scale of these graphs is linear and starts at time=0)

When the size of the key is equal to the size of the mapped value, both representations will always have the same performance. However, as the size of the mapped value grows, performance shifts to favoring separate vectors instead of single interleaved ones.

In reality, large sized mapped values are a rare sight. It is often more desirable for the map to hold pointers that point to large objects than it is for the map to hold large objects directly. This enables the container to have fast insertion and removal without problems of reference invalidation.

Interface

The interleaved reprentation uses std::pair value types and is thus is strongly preferred from a design perspective. Interleaved vectors of pairs mesh well with the existing standard library and containers. In addition, the container adaptor design of flat containers strongly favors single containers.

Thus, the proposed flat container design uses single, interleaved vectors of pairs.

Sortedness

Take a look at the code below. Will it compile? What will it print when it runs?

boost::container::flat_map<int, int> map;

map.insert(std::make_pair(0, 0));
map.insert(std::make_pair(1, 0));

map.begin()->first = 2;

std::cout << map.count(0) << ',' << map.count(1) << ',' << map.count(2);

This turns out to be undefined behavior. The most likely result is:

0,0,0

Boost's flat_map implementation uses a single sorted vector of pairs. Unlike std::map, the keys of flat_map's pairs are not const; they can be modified. Modifying the keys causes the vector to become unsorted, and an unsorted vector causes flat_map to break.

This problem is not unique to Boost. Loki, EASTL, and every other popular implementation have this problem too.

The search for a solution

There are possibilities to design flat container classes which are impossible to make unsorted. All such designs are based around keeping key values immutable.

Idea: Reference Proxies.
- Useful, but full of gotchas. Reference proxies break the property of, "Do what people expect."
- Not everyone considers them "zero overhead".
- Nobody wants another vector<bool>.
Idea: Abandon std::pair iterators.
- Breaks compatibility with standard algorithms and containers.
- Interface becomes awkward and verbose in certain places.
Idea: Const iterators by default.
- See below.

Const iterators: Safety by default

All iterators of flat containers can be made const iterators by default. This means that references to const pairs are returned when dereferencing flat container iterators.

fc::vector_map<int, int> map;
*map.first() = std::make_pair(1, 2); // doesn't compile
map.first()->first = 3; // doesn't compile
map.first()->second = 4; // doesn't compile

There are two ways of modifying mapped values.

First, one can use the has function. has works just like at and find, but returns a pointer value instead of an iterator.

if(int* mapped_value = map.has(4))
    *mapped_value = "has is handy";

Whenever possible, use has.

Second, one can use an iterator public data member called underlying. underlying is the iterator equivalent of container and grants access to iterators belonging to the adapted container. These 'underlying' container iterators have their usual const semantics and can be used to modify both keys and values.

map.first().underlying->second = "underlying is useful";

Like container and const_cast, use of underlying is an explicit declaration to others that you know what you're doing and you can verify that it is safe. This explicitness is what makes underlying the safer alternative to Boost-style iterators.

Are unsortedness states beneficial?

Yes. Going all-out to prevent unsorted states leads to verbose, hard to use container designs that are closed off from user optimizations.

It more practical to be "safe by default" than "safe all the time".

Guidelines for coders

There are only three things that can lead to unsorted states:

container
underlying
Throwing moves.

Avoid all three and flat containers are guaranteed to remain sorted.

Always remember that std::sort followed by std::unique can be used to turn any unsorted state into a sorted state.

Sorted Preconditions

Permitting unsorted states in flat containers requires the standard to address and define this behavior.

Affected member functions gain preconditions based on the preconditions of lower_bound, upper_bound, and equal_range.
- Requires: The elements e of [begin, end) shall be partitioned with respect to the expression key_comp(e, value).
Unsorted/unpartitioned flat containers are well-defined for all operations that do not have such preconditions.

Uniqueness

The rules regarding duplicate keys are less restrictive than the rules regarding sortedness.

The intention of flat_map and flat_set is to hold unique keys, but this is not a hard requirement of the standard. The behavior of member functions on duplicate-key containers is well defined but implementation specific. For example, using find on a container with duplicate keys will return an iterator to any one of the matching values. This matches the preexisting behavior of std::multimap.

Container Adaptors

The design of priority_queue has influenced the design of flat containers and the similarities between the two are apparent.

std::priority_queue

A light wrapper around a container. Default: deque.
Operations are defined in terms of the standard functions make_heap, push_heap, and pop_heap.

std::flat_map

A light wrapper around a container. Default: vector.
Operations are defined in terms of the standard functions lower_bound, upper_bound, and equal_range.

The ability to use static_vector and small_vector with flat containers is a major selling point of this proposal. The container adaptor interface does not add complexity or performance penalties to the library's implementation.

User-defined reference proxies

Users may want to use their own reference proxy containers with flat containers. For example, consider a container which attempts to implement the "two separate vectors" idea discussed in the Internal algorithms section:

template<typename K, typename T>
struct two_vectors
{
    class iterator { /* ... Reference proxy implementation ... */ };
    /* ... Sequence container member functions ...  */
    std::vector<K>; keys;
    std::vector<T>; mapped;
};

std::flat_map<int, double, two_vectors<int, double>> map;

two_vectors does not store pair value types but it can be faked to do so with reference proxies. Note that reference proxies can apply const to their references to prevent users from modifying keys.

If possible, reference proxy containers should be permitted by the standard for use in flat containers.

Packed

Users may want to use custom pair types. For example, consider this packed_pair:

template<typename A, typename B>
struct packed_pair
{
    using first_type = A;
    using second_type = B;
    A first;
    B second;
} __attribute__((packed));

This type can be used as follows:

std::flat_map<char, int, std::vector<packed_pair<char, int>>> map;

If possible, custom value types should be permitted by the standard for use in flat containers.

Public `container` data member

The standard library's current container adaptors are not well-loved, and this is largely due to them having less functionality than the containers that they adapt. The fact is, there are very few cases where stack and queue are the ideal container choice. Far more often it makes sense to use vector and deque directly and gain access to all the extra functionality they provide. The current container adaptors are practically useless.

With a desire to make flat containers useful, the natural solution is to make their adapted container member publicly visible. This comes at no loss to safety; unsortedness would exist in the design regardless of whether or not container was public. There are no synchronization issues to worry about as container can be the one and only data member of the flat container classes. And by making container a public data member, several advantages are to be had.

Case 1: Delayed sorting

Associative containers often follow this access pattern:

Store a bunch of values.
Wait.
Retreive a bunch of values.

An efficient way of storing multiple values into a flat container is to insert the values at the end and sort the container afterwards. For cases involving a single range, the flat container's insert function can be implemented to have this behavior:

std::flat_multiset<int> set;
set.insert(some_range.begin(), some_range.end(), std::delay_sort);

// Note: The constructor has overloads too.
// The code above could be rewritten like this:

std::flat_multiset<int> set2(some_range.begin(), some_range.end(), std::delay_sort);

Unfortunately, not all real-world situations have such clean-cut insertion ranges, and is it not always desirable for the container to sort itself after every insertion. For cases such as these, the public container data member offers a solution:

std::flat_multiset<int> set;
while(cpu_temperature() < 100)
    set.container.push_back(cpu_temperature());
std::sort(set.begin(), set.end());

This technique is easily adapted to whatever access patterns users have.

Case 2: Access to member functions

Functions such as reserve and max_size are not part of flat container interfaces. This is not a problem with a public container data member.

map.container.reserve(1000);

Flat containers are container adaptors and so any type of vector-like container can be used. Access to container allows users of custom vector implementations to use whatever member functions they need.

std::flat_map<int, int, my_vector<std::pair<int, int>>> map;
map.container.set_node_size(10);
map.container.debug_diagnostics = true;
my_library::hook_container(&map.container);

Occasionally it may be desirable to iterate a container using indexes rather than iterators. While it is possible to do this using the operator[] of iterators, using container for this is far easier to grasp.

for(int i = 0; i < map.size(); ++i) std::cout << i << ':' << map.container[i].second << std::endl;

Case 3: Copying and moving

The container member can be used to copy and move directly.

std::flat_set<int> set;
std::vector<int> vec = std::move(set.container);
set.clear();

set.container = std::move(vec);
vec.clear();

This is far cleaner than adding overloads to operator=.

Are there any downsides to `container`?

Nope!

Unsorted states exist in the design for iterators, not container. Removing container would not remove unsorted states. The container member is the one and only member of flat container classes. There are no variable out-of-sync issues to worry about.

[More programming articles] [Back to my homepage]