Hacker Newsnew | past | comments | ask | show | jobs | submit | cryptonector's commentslogin

> wage slaves

In the knowledge worker space, the wages are pretty nice. It's a stretch to call out "wage slaves".


If you can't quit your job because you have to pay your mortgage, then you are.

There is no middleclass. Wage slave is the correct description.

Devs are expensive. Of course management wants to measure what they produce. It's a hard problem. There are no magic solutions yet.

GP isn't saying that there is evidence that open offices work. GP is saying that execs want such evidence. Way back when Google was young its execs thought outside the box, so it's no surprise that they didn't copy what MSFT was doing.

Wait, so now there are thinking-outside-the-box-execs who don’t need any evidence, and regular-gimme-evidence-execs who do?

Yes, but only in young startups. Once the companies earnings go beyond a certain point they get MBAs for executives.

If we weren't making shoplifting not a crime, then we wouldn't be having that worry right now.

Blame jurisdictions that made shoplifting up to $900 or similarly large amounts practically not-a-crime.

sarcasm detector broken

Is a "semantic layer" nothing more than a fancy name for a SQL VIEW in a NoSQL?

No, it's more than that.

Semantic Layer is about decomposing views into dimensions and aggregates, then letting downstream apps/users compose their own views on top without having to redefine/re-calculate business level metrics.

This makes data analyis more flexible than sql views which are hardcoded on particular groupings.


It's a lot more. A SQL VIEW is just a saved query, where a semantic layer defines the shared meaning of the data, and helps enforce consistent metrics, joins, and logic across tools. You'd be surprised at how many ways "active customer" can be represented as SQL.

Doesn't a view do that?

  create view active_cx as select * from customer join audit_events using(...) join ... where -- active condition

  -- use active_cx wherever

  select ... from orders join active_cx using(...) where ts > start_of_month() group by active_cx.id

It sounds like "semantic layer" == views/queries created automatically and on the fly.

Kind of annoying the article writes "What is [a semantic layer] anyway?" twice but never defines it directly.

OP here - I wrote extensively about, that's why I linked to existing article rather than explaining once more, and focusing on the why and how to build one. See also comment above: https://news.ycombinator.com/reply?id=44960004&goto=item%3Fi...

I looked for such a link in TFA, and it wasn't obvious.

The original Unix in-kernel wait queues were also like that.

Ideally the value should be two words -- 16 bytes on 64-bit systems.

Emm, what? Why? If you mean two processor words, which I gather from what you are saying, then I think you are already in the space of full memory barriers.

In that case, why not just say, ideally it would be 256K words, or whatever?


Because mainstream modern architectures (practically speaking, x86-64-v2+ and ARMv8+) give you[1] a two-word compare-and-swap or LL/SC.

[1] https://ibraheem.ca/posts/128-bit-atomics/


However using compare-and-swap as the atomic operation for implementing multiple events can be very inefficient, because it introduces waiting loops where the threads can waste a lot of time when there is high contention.

The signaling of multiple events is implemented efficiently with atomic bit set and bit clear operations, which are supported since Intel 80386 in 1985, so they are available in all Intel/AMD CPUs, and they are also available in 64-bit Arm CPUs starting with Armv8.2-A, since Cortex-A55 & Cortex-A75, in 2017-2018 (in theory the atomic bit operations were added in Armv8.1-A, but there were no consumer CPUs with that ISA).

With atomic bit operations, each thread signals its event independently of the others and there are no waiting loops. The monitoring thread can determine very fast the event with the highest priority using the LZCNT (count leading zeroes) or equivalent instructions, which is also available on all modern Arm-based or Intel/AMD CPUs.

When a futex is used to implement waiting for multiple events, despite not having proper support in the kernel, the thread that sets its bit to signal an event must also execute a FUTEX_WAKE, so that the monitoring thread will examine the futex value. Because the atomic bit operations are fetch-and-OP operations, like fetch-and-add or atomic exchange, the thread that signals an event can determine whether the previous event signaled by it has been handled or not by the monitoring thread, so it can act accordingly.

So currently on Linux you are limited to waiting for up to 32 events. The number of events can be extended by using a multi-level bitmap, but then the overhead increases significantly. Using a 64-bit futex value would have been much better.

In theory compare-and-swap or the equivalent instruction pair load-exclusive/store-conditional are more universal, but in practice they should be avoided whenever high contention is expected. The high performance algorithms for accessing shared resources are all based on using only fetch-and-add, atomic exchange, atomic bit operations and load-acquire/store-release instructions.

This fact has forced the Arm company to correct their mistake from the first version of the 64-bit ARM ISA, where there were no atomic read-modify-write operations, so they have added all such operations in the first revision of the ISA, i.e. Armv8.1-A.


> In theory compare-and-swap or the equivalent instruction pair load-exclusive/store-conditional are more universal, but in practice they should be avoided whenever high contention is expected. The high performance algorithms for accessing shared resources are all based on using only fetch-and-add, atomic exchange, atomic bit operations and load-acquire/store-release instructions.

> This fact has forced ... there were no atomic read-modify-write operations, so they have added all such operations in the first revision of the ISA, i.e. Armv8.1-A.

I'm not sure if you meant for these two paragraphs to be related, but asking too make sure:

  - Isn't compare-and-swap (CMPXCHG on x86) also read-modify-write, which in the first quoted paragraph you mention is slow?
  - I think I've benchmarked LOCK CMPXCHG vs LOCK OR before, with various configurations of reading/writing threads. I was almost sure it was going to be an optimization, and it ended up being inobservable. IIRC, some StackOverflow posts lead me to the notion that LOCK OR still needs to acquire ownership of the target address in memory (RMW). Do you have any more insights? Cases where LOCK OR is better? Or should I have used a different instruction to set a single bit atomically?

In terms of the relative cycle cost for instructions, the answer definitely has changed a lot over time.

As CAS has become more and more important as the world has scaled out, hardware companies have been more willing to favor "performance" in the cost / performance tradeoff. Meaning, it shouldn't surprise you if uncontended CAS as fast as a fetch-and-or, even if the later is obviously a much simpler operation logically.

But hardware platforms are a very diverse place.

Generally, if you can design your algorithm with a load-and-store, there's a very good chance you're going to deal with contention much better than w/ CAS. But, if the best you can do is use load-and-store but then have a retry loop if the value isn't right, that probably isn't going to be better.

For instance, I have a in-memory debugging "ring buffer" that keeps an "epoch"; threads logging to the ring buffer fetch-and-add themselves an epoch, then mod by the buffer size to find their slot.

Typically, the best performance will happen when I'm keeping one ring buffer per thread-- not too surprising, as there's no contention (but impact of page faults can potentially slow this down).

If the ring buffer is big enough that there's never a collision where a slow writer is still writing when the next cycle through the ring buffer happens, then the only contention is around the counter, and everything still tends to be pretty good, but the work the hardware has to do around syncing the value will 100% slow it down, despite the fact that there is no retries. If you don't use a big buffer, you have to do something different to get a true ring buffer, or you can lock each record, and send the fast writer back to get a new index if it sees a lock. The contention still has the effect of slowing things down either way.

The worst performance will come with the CAS operation though, because when lots of threads are debugging lots of things, there will be lots of retries.


One thing to add here, I've enjoyed reasonably extensive support for `atomic_compare_exchange_strong()` and the `_explicit` variant for quite a long time (despite the need for the cache line lock on x86).

But, last I checked (the last release, early last year) MUSL still does not provide a 128 bit version, which is disappointing, and hopefully the AVX related semantics changes will encourage them to add it? :)


Because there are apps that use two-word pass/return-by-value values internally, so it'd be convenient.

I think part of it is that you shouldn't be using recursive locks, so why bother specifying support for them? IMO.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: