Efficiency with Algorithms, Performance with Data Structures – Chandler Carruth

"Software is getting slower more rapidly than hardware becomes faster. -Niklaus Wirth: A Plea for Lean Software"

  • Niklaus Wirth – Algorithms + Data Structures = Programs

There are two sides to the performance coin:

Efficiency through Algorithms – How much work is required by a task.

  • Improving efficiency involves doing less work.
  • An efficient program is one that does the minimum (that we’re aware of) amount of work to accomplish a given task.

Performance through Data Structures – How quickly a program does it work.

  • Improving performance involves doing work faster
  • But there is no such thing as a "performant", not even work in the English language.
  • There is essentially no point at which a program cannot do work any faster… until you hit Bremermann’s limit…
  • What does it mean to improve the performance of software?
    • The software is going to run on a specific, real machine
    • There is some theoretical limit on how quickly it can do work
  • Try to light up all the transistors, get the most work done as fast as possible.

All Comes back to Watts

  • Every circuit not used on a processor is wasting power
  • Don’t reduce this to the absurd — it clearly doesn’t make sense to use more parts of the CPU without improving performance!

Algorithms

  • Complexity theory and analysis
  • Common across higher-level languages, etc.
  • Very well understood by most (I hope)
  • Improving algorithmic efficiency requires finding a different way of solving the problem.
  • Algorithmic complexity is a common mathematical concept that spans languages and processors. Don’t get lazy, you have to pay attention to your algorithms.
  • Example: – "it’s about doing less work", analyze the current approach and find ways to do it more efficiency
    • Initially, you might have a basic O(n^2) algorithm
    • Next, we have a Knuth-Morris-Pratt (a table to skip)
    • Finally, we have a Boyer-Moore (use the end of the needle)

Do Less Work by Not Wasting Effort, Example 1

std::vector<X> f(int n) {
 std::vector<X> result;
	for(int i = 0; i < n; ++i)
		result.push_back(X(...));
    return result;	
}

Initialize the collection size

std::vector<X> f(int n) {
 std::vector<X> result;
 result.reserve(n);
	for(int i = 0; i < n; ++i)
		result.push_back(X(...));
    return result;	
}

Do Less Work by Not Wasting Effort, Example 2

X *getX(std::string key,
        std::unordered_map<std::string, std::unique_ptr<X>> &cache) {   
    
    if(cache[key])
       return cache[key].get();
    
    cache[key] = std::make_unique<X>(...);
    return cache[key].get();    
}

Retain the reference to the cache entry

X *getX(std::string key,
        std::unordered_map<std::string, std::unique_ptr<X>> &cache) {   

	std::unique_ptr<X> &entry = cache[key];

    if(entry)
       return entry.get();

    entry = std::make_unique<X>(...);
    return entry.get();
}

Always do less work!

Design API’s to Help

Performance and Data Structures = "Discontiguous Data Structures are the root of all (performance) evil"

  • Just say "no" to linked lists

CPUs Have Hierarchical Cache System

Data Structures and Algorithms

  • They’re tightly coupled, see Wirth’s books
    • You have to keep both factors in mind to balance between them
  • Algorithms can also influence the data access pattern regardless of the data structure used.
  • Worse is better: bubble sort and cuckoo hashing

Writing Allocation Free Code in C# By Matt Ellis

Twitter: @citizenmatt
GitHub: https://github.com/citizenmatt/RefSemantics

Always be measuring – don’t do it because you think it’s the right thing to do. It might not be.

Managed memory is cheap. That’s not true, allocation is cheap and garbage collection is expensive. Garbage collection stops the world while things are cleaned up and optimized. Sometimes memory cleanup can impact the performance of the application.

Performance Low Hanging Fruit

Object reuse

Object pooling. Pass in existing array rather than allocating a new one. Preallocate array length.

String concatenation.

Use StringBuilder, Preallocate length if possible.

params arguments

void MyParamsMethod(params string[] args)
{

//... 

}

In Code:
MyParamsMethod("Hello", "World");
MyParamsMethod();

Complier:
MyParamsMethod(new {"Hello", "World"});
MyParamsMethod(new [0]); // this creates a new array object in memory

Instead, do this in code:
MyParamsMethod(new {"Hello", "World"});
MyParamsMethod(Array.Empty); // reuses an existing empty string array in memory. It's the same idea behind string.Empty

Suggestion: Introduce overloads with common number of arguments.

Boxing

Boxing creates a new object on the heap. Now you have two memory locations with the same value. Changing either value does not impact the other value.

Suggestion: Introduce generic overloads

Closures

The complier converts closures to classes. Captured values are passed in as constructor parameters. This class is then allocated to the heap.

Suggestion: Avoid critical paths. Pass state as argument to lambda. Investigate local functions. Since local functions lifetime is known, it can be allocated on the stack where allocation and cleanup are cheap.

LINQ

Lambda expressions are treated the same as Closures. They are allocated as a class to the heap. Because much of LINQ is based on static methods, additional allocations of Iterators and IEnumerable happen.

Suggestion: Avoid critical paths. Use good old foreach and if statements

Iterators

iterators are rewritten as a state machine, which means more allocations to the heap.

Suggestion: Return a collection. Be aware of the cost.

async/await

Async/await also, generate a state machine. Task and Task also trigger more allocations which then can’t be reused.

Look into ValueTask for common uses cases.

Suggestion: Investigate ValueTask.

Heap Vs Stack

Each method pushes space onto the stack for local variables. When the method is exited the memory is popped from the stack.

Stack allocation and cleanup are cheap because the memory has a lifetime. However, stack space is limited.

*Why passing value types copies the data versus passing a references it’s important to note that copying data isn’t that expensive.

Reference Semantics with Value Types

Allows value types to be used like reference types
– pass by reference everywhere

User value types to reduce allocations, reduce memory traffic, etc
– Throughput!

Pass by reference to avoid copies, enables modifying, etc.

Very low-level micro-optimizations…
– But they’ll be used in the platform…
– (And games, and parsing, and serialization, and …)

*These are low level optimizations that should only be used in paths were performance matters.

C# 7.2 Reference Semantics with Value Types

Allocating a reference type has a cost, but passing it around is cheap.
Allocating a value type is cheap, but passing it around has a cost.

Why can’t it be cheap to allocate AND cheap to pass around?

in parameters

Pass value type by reference. Called methods cannot modify it.

Method argument modifier

Complements out and ref

Passed by reference

Method cannot modify original value

Compiler enforces safety with defensive copy when calling members

ref locals and ref returns (C# 7.0)

ref returns

Returns a reference to value the, not a copy of the value.

– return type of method becomes e.g. integer reference int& in IL

Lifetime of returned value must exceed the lifetime of the called method

  • e.g. a reference to a field or method argument. NOT a variable in the called method.
  • Not allowed on async methods.

Modifying this reference is the same as modifying the original value

  • e.g. return reference to array element, and update it in place.

Add ref modifier to method declaration return type, and to return statement

ref locals

Assign a ref return to a new variable will create a copy

  • The variable is a value type, not a reference. (Cannot assign int& to int)
  • A ref local is a variable that is a reference to a value type
  • Accessing the variable accesses the original value

Use a ref local to store the ref return result

Type inference with var will get the value type, not the ref modifier

  • Requires ref var to work as expected.

ref readonly returns

Returns a read only value type by reference

readonly struct

Immutable value types

in parameters and ref readonly can create defensive copies

  • The compiler doesn’t know if the struct’s methods will modify state

readonly struct – compiler enforces all fields and properties are readonly

Immutable

More efficient – no copies made when calling members

  • Improves performance (micro-optimization)

ref struct

Stack only value types

Declare a value type that can only be stack allocated

  • I.e can never be part of a reference type

This constrains lifetime to calling method

  • Also, cannot be boxed, cannot use inside a non-ref struct
  • Cannot use with async methods or iterators
  • Cannot be a generic parameters

Limited use cases

  • Working with stackalloc memory
  • Primarily for Span<T>

What does SPAN have to do with ref struct?

For thread safety, need to update all fields of Span atomically (tearing)

  • Whole point is performances – cannot use synchronization

Internal pointers require special GC tracking

  • Too many in flight at once is expensive

How can SPAN represent stackalloc memory is SPAN was on the heap?

Solution: Span<T> is a ref struct – can only be created on the stack

  • Constrained lifetime, single thread access

Span<T>

New type of unify working with any kind of contiguous memory

  • Arrays, array segments, strings and substrings, native memory, stackalloc, etc

Provides array-like API – indexer

  • ReadOnlySpan provides getter indexer only

Type safe – each elements is of type T

Array-like performance

  • Not quite, but newer runtimes have special support

Slicing

  • Create a new Span with a sub-section of existing – without allocations!

Span<T> Implementation

Value Type – struct

System.Memory NuGet package

  • .NET Standard 1.1 (.NET Framework 4.5)+

New APIs and overloads in the BCL

  • E.g. String.AsSpan(), Stream.ReadAsync(), Utf8Parser.TryParse()
  • Significant usage of ref semantics – allocation free!

Span, ReadOnlySpan, Memory

Two versions – “portable” and “fast”

  • fast requires runtime support

Span<T> Performance – Portable Implementation

Portable works on .NET Standard 1.1 and above

  • .Net Framework 4.5+

Portable is not slow

  • But not as fast as arrays

Three fields – object reference, internal offset and length

  • Slightly larger than fast version, dereferencing is slightly more complex operation.

Span<T> Performance – Fast Implementation

Fast requires runtime support

  • .Net Core 2.1

Only has two fields – “byref” internal pointer and length

  • Slightly smaller struct and accessing an element is slightly simpler operation

Specific JIT optimizations

  • e.g elding bounds check in loop, like arrays

Very close to array performance