Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Buffet – An All-inclusive Buffer for C (github.com/alcover)
206 points by Nakili on March 8, 2022 | hide | past | favorite | 76 comments


A very well thought out set of functionality. It covers a lot of ground. In particular:

• The reference counting of the bft_view can simplify reasoning in a C program.

• The cheap bft_view may keep programmers from copying as much data. It's easy to fall into a "copy everything" mode in C since that's what the runtime library provides and consumes. Working with slices reduces your cache pressure and is just generally faster.

• Allocating only blocks sized by a power of two is potentially memory wasting. The exponent for the size has a range of 0...255, but it only needs to get to 2^32 since that's the largest allowed length. I suspect there is some clever base between 1 and 2 which is trivial to compute the exponent and inverse in integer arithmetic. (Or similarly shaped function, it doesn't have to be an exponential, just something that grows about like one.) This would provide better granularity.

• Alternatively, some bits could be harvested from 'cap' to say how the memory should be freed, usually it is 'free', but one could imaging adding a few lines to support 'mmap', then the release becomes 'munmap'.

• Now that we have imagined up mmap capability, this looks like a very nice way to handle many kinds of data files which are not character oriented streams processed sequentially.


(author)

> A very well thought out set of functionality

Thanks! It's only an experiment at this stage but I was pleasantly surprised it worked in such a small size.

> slices (...) generally faster

It works wonders. The upcoming bft_split function is gonna be dead fast.

I optimize things by adopting the laziest possible personna : do I really need to copy this ? No. What if the user wants to append stuff to a split part ? We'll see then.

> Allocating only blocks sized by a power of two is potentially memory wasting

It just follows the classical doubling of requested size. In high exponents though I admit it'll be wasteful.

> some clever base between 1 and 2

I wouldn't have thought of it, nor known how to. Would it be somehow costly ? Shifting is basically free.

> support 'mmap'

Interesting. But shouldn't we keep the lib surface minimal and just leave this to the user ? She maps then takes a view on it ?


Hey, an opportunity to shill my favorite ISO standard!

It's 3. ISO 3, preferred numbers: https://en.wikipedia.org/wiki/Preferred_number#Computer_engi...

For your application, I would apply a heuristic which includes the x3 values at some point (256KiB?), and maybe the x5 values near/at 256MiB.

Allocation is so expensive relative to calculating the allocated value that you could use something really complex and it would work. Certainly multiplying by a factor between one and two and rounding to the nearest word would work, there's no way you'd see a performance difference from the calculation.

The advantage of ISO 3 is that the user always gets a familiar number of allocated bytes back.


> > some clever base between 1 and 2

> I wouldn't have thought of it, nor known how to. Would it be somehow costly ? Shifting is basically free.

Multiplying by 1.5 is a shift and an add, so just as cheap.

Multiplying by 1.5 has the advantage that 1 + 1.5 > 1.5 * 1.5, so the second time you realloc it fits into the old memory (which can help reduce memory fragmentation if the allocator takes advantage of it).


> Multiplying by 1.5 has the advantage that 1 + 1.5 > 1.5 * 1.5, so the second time you realloc it fits into the old memory (which can help reduce memory fragmentation if the allocator takes advantage of it).

I’m really struggling to think of any mainstream allocator that can take advantage of this.


The clever base between 1 and 2 you are looking for is the golden ratio. I would use it only for bigger numbers though. under 5k 2 is good enough.


Power of two reallocs are usually there to prevent append in a loop from becoming accidentally quadratic so I guess it's either a bug or a feature depending on if you're going to do that.


As long as there is a non-zero multiplier, it does not matter the size of the constant factor, and it does not even need to be constant. I spent a while recently looking at the shape of this non-typical allocation growth curve, which ended up going with `n₁ = n₀ + 4n₀^(7/8) + n₀/8` using only integer math:

    // compute maxsize = maxsize + 4*maxsize^(7/8) + maxsize/8
    // for small n, we grow faster than O(n)
    // for large n, we grow at O(n/8)
    // and as we reach O(memory) for memory>>1MB,
    // this means we end by adding about 10% of memory each time
And was graphed, discussed, and approved here:

https://github.com/JuliaLang/julia/pull/40453/files#diff-fb6...


Python is similar[1]

>The growth pattern is: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...

https://github.com/python/cpython/blob/3.7/Objects/listobjec...


Any multiplier > 1 is enough to keep a "grow by constant each iteration" loop from becoming quadratic.


> I suspect there is some clever base between 1 and 2 which is trivial to compute

Something like this perhaps (although this might be too much granularity):

  (8 + (cap & 7)) << ((cap >> 3) + 1)

  16 18 20 ... 30
  32 36 40 ... 60
  64 72 80 ... 120
  128 ... etc.


I remember seeing a paper that concluded the expansion should not be more than the golden ratio (ϕ = 1.618033988749). Given how widespread the golden ratio, I'd hypothesise that it's probably optimal.

The first three fractions approximating ϕ are: 3/2, 8/5 and 21/13. The first two are quick and easy to compute.


If the multiplier is less than the golden ratio, then the new allocation fits into the memory released from the previous two allocations.


But most allocators will not actually take advantage of that, so it’s a pretty dubious argument.


Good point, I haven't actually checked if any realloc implementations do that.


I thought, oh, you only need two bits for the multiplier, that gives you four multiples of each tier size, that should be enough.

But that means the largest buffer (with a slightly less sophisticated formula than yours) is ((2 << 2) * 2 << (2 <<(8 - 2) - 1)) = 32 exabytes.

Meanwhile, yours has a maximum of 60 GB. So i don't think you have too much granularity!


(author)

I just woke up minutes ago and see this... I intended to post this project on HN when it would be mature, not mere hours old... But here we go.

I'll try to answer questions though I'm out of coffee.

Related: to those willing to swap type-safety for ease of use, I also did Stricks, a faster alternative to SDS. https://github.com/alcover/stricks


Thank you for working on libraries like this! I expect the community will be supportive, even if the code hasn't been baking very long. That's how great libraries prove themselves, through usage & feedback.


Here's a C vector implemented in a macro we use pretty extensively which is along similar lines:

https://gitlab.com/nbdkit/nbdkit/-/blob/master/common/utils/...

On top of this we implement a string type:

https://gitlab.com/nbdkit/nbdkit/-/blob/master/common/utils/...

Example usage of the vector to store NULL-terminated argv-type arrays of char*:

https://gitlab.com/nbdkit/nbdkit/-/blob/master/common/utils/...


I'm going to be skeptical on this. Shoving refcounting + SBO + non owning views into a single type is not a smart decision.

Before going into why I think so, some general implementation things to look into:

  * Power of 2 exponent is not optimal. You will never be able to reallocate into the same stretch of memory that the arrays have previously occupied, as the sum of their sizes will always be smaller than the next one. C++ stdlibs have switched to values closer to 1.5.
  * Extra care should be taken when using memcpy. It's possible to copy overlapping buffers which is not allowed. using memmove instead is a simple solution. Also note the case when appending the Buffet structure itself to its array.
  * You can't yet count on all machines having ctlz/lzcnt instruction. There have been quite a few cases of this instruction crashing software because no fallback was provided lately.
  * If someone asks for bft_new(bft, 0x1000) you'll end up allocating 0x2000 bytes even though an exact size was specified. IMO scaling should be done only on appends.
  * If you force the structure alignment to 16 you might as well say that your type is 24 bytes in size, because you'll very likely be adding tons of padding to structures that include your buffer.
  * Seems like quite a few checks for malloc failure are missing + proper reporting for that.
Now why I think this is a terrible idea:

  * The model of your reference counting is a minefield and will 100% lead to problems when used in real world.
    - Free'ing OWN before all refs = leak (although this could be fixed by giving OWN refcount of 1 instead of 0)
    - Reference count is non-atomic = potential corruption
    - bft_view() has differing behaviors of output depending on what state it's in = hard to reproduce bugs.
  * You won't be fast just because you fit into 16 bytes.
    - Your code has branches on every single hot path that is virtually free on other implementations. There isn't really even anything you can optimize as all the branches essentially have equal weight, so you'll be having mispredictions by design.
    - You have multiple levels of indirection for data pointers so you'll be stalling the pipeline on cache misses and dependent loads as well as bad speculation.
  * SBO - OK, refcount - OK, but why add non-owning view into the equation? What's wrong with good old pointer+size or pointer start+pointer end for that. There already are enough footguns to worry about with 3 states.


(author)

Thank you a lot! I'm humbled by the attention you took. Please bear in mind I never intended to see this project publicized so early on the HN stage. I'm still experimenting in C, and it shows...

That's a lot to chew on. For the moment :

> Free'ing OWN before all refs = leak

More like UAF ? Also, you can't. That was the idea. But it seems a bad one, according to you and user rom1v.

> Reference count is non-atomic = potential corruption

I'll try to make it so. This aspect is new to me. Anyway I hear it's quite hard, beyond that, to have a thread-safe lib, so I'll advertize it as non-TS.

> branches

Maybe that's the price for compacting str+str_view into a single type ?

> enough footguns to worry about with 3 states.

I agree the VUE should maybe drop.


> More like UAF ? Also, you can't. That was the idea. But it seems a bad one, according to you and user rom1v.

Well in your code if you bft_free OWN with outstanding REFs, nothing happens and the bft_free on last REF does nothing as well so it becomes just a leak.

> I'll try to make it so. This aspect is new to me. Anyway I hear it's quite hard, beyond that, to have a thread-safe lib, so I'll advertize it as non-TS.

IMO thread-safe should be the default with an easy opt-out by defining a single macro or something like that.

> Maybe that's the price for compacting str+str_view into a single type ?

Kind of. Without the VUE functionality I think you could refactor it into 2 states - SBO or REF.

  struct Buffet {
    union {
      struct {          // | u32 refcnt | char data[] |
        void*     ptr;  // --------------^ points to data. at offset -4 is refcount / start of alloc
        SIZE_TYPE size;
      } large;                                        // cap > sizeof(void*) + sizeof(SIZE_TYPE)
      char small[sizeof(void*) + sizeof(SIZE_TYPE)];  // cap <= sizeof(void*) + sizeof(SIZE_TYPE)
    } storage;
    SIZE_TYPE cap; // on big endian this value could also function as the null terminating byte
  }


> if you bft_free OWN with outstanding REFs, nothing happens and the bft_free on last REF does nothing as well so it becomes just a leak.

I seem to have solved this with a `wantfree` flag. https://github.com/alcover/buffet/commit/ade05630aafdf5c940e...


Buffet is a versatile string buffer type featuring :

* automated storage and reallocation * SSO (Small String Optimization) : short data is stored inline * views : no-cost references to slices of data * ref count : accounting of reference-owner relation

In only 16 bytes, fitting a 128-bit register.


Fantastic, would love to use it, but, it needs a license.


Same here, I need something like this for my Lisp interpreter. I was thinking about Simple Dynamic Strings but I like this better.

https://github.com/willcipriano/Connery


(author here)

> I was thinking about Simple Dynamic Strings but I like this better.

Oh! here's my chance to plug https://github.com/alcover/stricks ! It follows the same principle as SDS (i.e user-facing type is char*) but is much faster (see bench) and frankly nicer to read through.

Not as feature complete, though.*


I used https://github.com/kumkee/utf8 (along with a bit of hacking) in my toy scheme interpreter to get Unicode strings.

Had to adapt some script I found to generate character class lookup functions to make it fully work IIRC and still need to plug-in some sort of copy on write mechanism since the strings are managed in a shared buffer but other than that it works pretty well…if you’re into the unicorn strings that is.


have you looked into sds: https://github.com/antirez/sds/


Have you read the parent comment?


so used to just calling it sds. it took me a second . my bad homes


(author here)

Thank you. The license will come but I have a harder time understanding them than pointers haha.


For a utility library like this BSD 2 clause or MIT (they're essentially the same) would probably be the most common choice.


If you want others to make money from your code: MIT, BSD

If you want to use it as an offering to the open source community: GPL


When its mature, would you object to dual GPL+commercial ?


32 bits for Len & offset doesn't play nicely with mmapped files.


(author)

My goal here is to :

- merge what C++ std::string and std::string_view do into a single C type.

- make it foremost fast. In 16 bytes, it fits snuggly into a xmm register.

- use this one day in a new language runtime


(following a discussion from https://github.com/alcover/buffet/commit/eab9648ff19483f57fd...)

> if nobody's looking, we are go

https://github.com/alcover/buffet/blob/dee3eb65f37ca07ea3d27...

And otherwise, when is it freed?

The refcount mechanism looks wrong to me: either the ownership of internal data is shared between instances (it seems it is not the case), either the owner is well defined and does not need a refcount.

It appears that the refcount is used to avoid freeing the owner data if Buffet views are still alive. But using the views after freeing the referenced data is incorrect usage anyway, so it looks like defensive programming, trading a use-after-free for a leak.


License?


Can you elaborate on the tagging scheme? I'm not sure if the tag is stored in the high bits or low bits of the data pointer. If it's the low bits, how does it interact with the VUE repr, whose pointers are char-aligned (that is, unaligned)?


It seems that the tag is in the low bits, and for VUE displaced low bits are stored in the offset field instead.


Definitely low bits:

#define DATA(buf) ((char*)((intptr_t)((buf)->data) << BFT_TYPE_BITS))

Older programmers tend to raise an eyebrow at added computation like the "<<", but here in 2022 we need to remember that we are about to do a memory access and even if it's in L1 cache, that "<<" is basically free by comparison.


Most microcontrollers do not have cache of this form (rather a simple read-ahead buffer at most). Microcontrollers are also one of the targets of this type of library and the shift absolutely will have an effect.


This is somewhat similar to how Swift represents Strings internally (https://github.com/apple/swift/blob/main/stdlib/public/core/...).

Swift needs to support a few other options (it tracks UTF-8 vs UTF-16 encodings, and needs to know if a String is bridged from an Objective-C object), and it doesn't cap lengths or offsets at 32b. Those details aside, this is a very familiar approach.


    typedef union Buffet {
      
    // SSO
    struct {
        char        sso[15]  // in-situ data
        uint8_t     ssolen:4 // in-situ length
    }
I feel like this doesn’t work in ARM GCC. Or rather it works but it’s going to be 16 bytes right there. Are systems that are not aligned also default to packed? Does that matter here?


Yes, this will be 16 bytes on Arm, and everywhere else pretty much. That doesn't mean the whole outer struct can't also be 16 bytes (this struct is one member of a union).


Yea but here is my problem.

    uint8_t  type : 2  // tag = {SSO|OWN|REF|VUE}
I had thought the tag was part of the SSO, but it’s not. Well, not accessible but I think that what some of the unused bits end up being?


I also find that questionable. The compiler is allowed to pull a bit field from low or high bits (i.e. not specified in the standard) and it might be further affected by endianness.

This code is assuming that on little-endian, gcc pulls 62 bits from the low bits of the word, and on big-endian that gcc pulls from the high 62 bits of the word. It might actually be what gcc does, but I don't think any of that is guaranteed.


(author here)

That's concerning.. Should I rather use a full 64bit data field + masking ? I'm a bit naive about portability.


So a disclaimer, I have never read the C standard in full and I don't consider myself an expert, but this one about bit fields is one of many anecdotes I've accumulated over the years about portability. The rule is never use a bit field in a serialized file format because different compilers could interpret them differently. As long as a bit field never leaves the memory of the host that wrote it, they're fine. But in this case, you are expecting the bit field to overlay in a specific manner with a union, so I think you run into the same problem.

Making a library more portable in C usually means making the code an ugly mess of macros. So, first decide how much time and code clarity you want to sacrifice vs. the systems you want to be able to run it on. I have a large scale personal project data encoding library that I never finished in part because I tangled myself up in trying to make it 100% defined behavior C code. The world only has 3 major compilers: gcc, clang, and visual studio. If your code works on all 3, and its a hobby project, that's probably good enough. There are also vanishingly few big-endian systems left, aside from some embedded ARM processors. Also, declaring :62 will be a compile error on a 32 bit system, so that's maybe your biggest portability concern? But again, 32 bit is either old or embedded, so maybe not worth worrying about.

I believe that to make this portable to big-endian or 32 bit, you would need to make a macro to access the pointer and flags. Then you need to detect the endianness in preprocessor directives and set the macro to either a shift or a mask.


That was very helpful. I'll skip the endianness and try to resolve the hardcoded :62 part.


(author)

Should I add __attr(packed) to be on the safe side ?


I probably would. Or at least a comment about packed. I can’t remember when __attr(packed) works and when it doesn’t.

You probably don’t preprocessor hell (see xxHash if you don’t know what I mean), so maybe just determine one way or any other and leave a comment somewhere.


I think I prefer my design, but I have the luxury of a GC. The SSO is 64bit, but has no length. strlen on 7 byte can be made constant via SSE. It's just zero terminated and needs some tag bits, with 00 for the SSO, 01 for longer strings and 10 for view.


A small set standardized data structures (vectors, maps) would do so much good for C.


That would be nice, then I wouldn't have to use non-standard stuff.

I made my own easy-to-incorporate-into-any-project library - https://github.com/lelanthran/libds - just copy the ds_*.h and ds_*.c into a project and you're good to go.

I'm not saying it will work for you, but it works for me.


So it's not standard nor cross platform, but I consider Glib to be the defacto standard data structures library. Have you tried it out?


Cool!

Why not put the `type` field also in the embed struct if, per the diagram, it's also going to be checked there?


This is how it is implemented in Go.


Put this into the next C2x and we won't need Rust anymore.


The problem isn't Rust, or whatever else that is trying to replace C, rather the amount of devs that keep ignoring C best practices how to write secure software from day one.

Already making use of static analysers (lint exists since 1979), warnings as errors on the CI/CD pipeline, turning on all compiler defenses on debug builds, would make a huge difference.


The implementation isn't endian friendly :(


(author)

Is it because of bitfields ? Should I rather use a full 64bit data field and mask it to get to my flags ?


The first thing that stuck me was distinguishing the `SSO' part of the union from the rest. It requires the `ssolen' field to align to the later `type' field which isn't guaranteed.

There might also be issues with the `data' field if padding gets introduced.

Having the `type' field (as a uint8_t or char) first would solve this but loses some bits of storage. The other option would be making the type an array of char and bit bashing everything (and memcpy(2) multibyte fields since unaligned accesses aren't universal).


Are bit fields a good idea?


The layout of a bit field in C is implementation defined. That makes them not a great choice for interfacing with hardware registers/descriptors or interoperating with binary file formats. (unless you are fine welding yourself to a specific set of compilers)

In applications like this where you really don't care about the layout of bits, they are just fine.

However… there is a violation in this library. As nrdvana points out in another comment, the "uint8_t ssolen:4" and the "uint8_t type : 2" fields are assumed to layout on top of each other using the same bits of the same byte, but this isn't really guaranteed by the C standard, though I think any sane compiler will do it.


(author here)

I think I read the order of a struct members is guaranteed. So ssolen and type should not overlap ?


[flagged]


UNIX/POSIX clones and embedded still demand C for most workloads, so that is going to be quite hard to ever change.

There is some hope for embedded, when enough developers are gone, however for UNIX/POSIX it is impossible given the symbiotic nature of both, only when we get rid of UNIX clones.


Thanks. I was genuinely curious. I’m surprised embedded still overwhelmingly demands C, but I’m guessing there’s a lot of inertia which you alluded to.


Mostly due to culture, there are hardly any chips without C++ compilers, and plenty of them have Basic and Pascal compilers available as well.

This is something that can only be changed with new blood bringing a more open mindset.


C++ has all the problems that come with C plus all of its own problems. Also, OOP is pointless.


C++ != OOP, while C++ cannot fix flaws inherited from C, at least it can provide types with bounds checking, and data types with type driven invariants, something that C will never provide.


Embedded systems


Only obsolete versions of GCC are available for many old chips, so they are limited to C. Otherwise, it better to use Rust in new projects. Rust can do almost everything C can do in less lines of code, sometimes even faster and with smaller memory. Moreover, it's possible to write firmware, kernel, network server, web or desktop application in Rust, so single language can be used in broad number of cases.


We are a successfull hardware startup and our line of products are built around TI F28004x series MCUs. Even if we wanted to use Rust, none of our chips are in any of Rust's supported platform tiers. That's a nonstarter.


Looks cool but this project is only 11 days old. If this was mature I might've used this in EQE. But now I have my own bespoke solution.

I'm surprised how many lines of code this is, I guess the tagged pointer stuff adds a lot of complexity




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: