Python continues to introduce more non-intuitive semantics that may be a small b...

hjk05 · on March 6, 2019

In my teaching of python to newcomers (mostly coming from matlab/R or no programming background) they often try to do dict_a + dict_b, and are confused as to why that doesn’t work when list_a + list_b works fine.

It think it’s an extreme stretch to claim it’s non-intuitive.

speedplane · on March 9, 2019

If dict addition was purely insertion, I would agree with you, but there is no way the following is intuitive:

  {1 : 1} + {1 : 2} ==> {1 : 2}

meowface · on March 6, 2019

I couldn't disagree more. Python 2 was a mess. range vs. xrange, items vs. iteritems, keys vs. iterkeys, input vs. raw_input, strings vs. Unicode strings, integer vs. float division were a mess, and were especially confusing and inconsistent for beginners.

Teaching Python 2 to beginners was always annoying for them: "ok so there's this function called input() but NEVER use it, always use raw_input(), unless you like RCE", "although all the tutorials say `for i in range()`, you should really get in the habit of using xrange() because...". Generators don't need to be explained in detail or understood by a beginner; all that really needs to be taught is the concept of iterators, and eventually, at an intermediate stage, the idea that some iterators are lazily-evaluated.

A simple dict "copy + merge" addition operator is a perfectly reasonable idea that will help beginners, not hurt them.

speedplane · on March 9, 2019

> Python 2 was a mess. range vs. xrange, items vs. iteritems, keys vs. iterkeys

Generators execute asynchronously, you have to keep track of what they are up to and where they are in their iteration process. This can cause all sorts of problems. Consider the following:

  for x in pull_from_database():
    do_something_with_disk_or_network(x)

If pull_from_database returns a list, the code can be relatively easily understood. If it's a generator, this can be an incredibly confusing piece of code because do_something_with_disk_or_network can alter the generation of pull_from_database.

The same logic applies to iterating over dictionaries or other items. With python 3, I'm sure we'll start seeing many bugs of the following nature that can be pretty difficult to debug:

  d = get_dictionary()
  for k, v in d.items():
    do_something_and_possibly_mutate(d)

dbrgn · on March 6, 2019

To me, the + operator for merging lists seems very intuitive.

Areading314 · on March 6, 2019

The update function does not work well. It is very cumbersome to have to do an in-place update. A frequent bug I see is

  def my_func(d1, d2):
      """Returns a merged dict"""
      d1.update(d2)
      return d1

The problem here is that now the d1 you have passed in has been modified to contain all the keys of d2, overriding any keys that appear in both with d2's value. Having a first-class operation that does a merge without mutating the inputs will make the language easier, not harder.

speedplane · on March 9, 2019

I agree that the update function can be cumbersome, but the "+" operator implies semantics that do not apply well to dictionaries. For example, the following is illegal in python:

  {1, 2, 3} + {3, 4, 5}

Similarly, addition with dicts should not be allowed. The pipe operator would be a closer fit, but even that has problems because with sets it's commutative and with dicts it's not.

varelaz · on March 6, 2019

Yeah, I hate this. Now dict.items() become not thread safe just because of iterators. It could crash anytime just because you modified dict in another thread while iteration is in progress

quietbritishjim · on March 6, 2019

I'm pretty sure that the situation you're describing was not thread safe in Python 2 either.

Sure, once you're in the body of the for loop, the dictionary must have been copied to the list so you're safe. But while d.items() is being evaluated at the start of the for loop, there is an internal iteration that could be preempted by the other thread. The GIL doesn't save you because Python operations aren't guaranteed to be atomic, and I doubt something that complex would be (it would be a serious problem if iterating over a large dictionary in one thread held up all other threads for an arbitrarily long time). Even if it is GIL-atomic, you're risking breakage if you move to another implementation (e.g. pypy) or if Python changes its atomiticity in future.

In general, if you want to modify an object in one thread and read it in another thread, you should add locking to prevent this happening simultaneously.

It is however true that the Python 2 items() method allows you to modify the dictionary in the body of the same for loop. But this is a surprising exception compared to iterating over a list or other container, so it makes sense overall to demand you explicitly make a copy if that's what you want.

varelaz · on March 6, 2019

In python 2 items() returned list, and access to dictionary was blocked by GIL, so while array is prepared, dict couldn't be modified. So it is thread safe in Python 2. In python 3 you need to lock, but it's not always obvious until it bites you. You may think that you need threads only for parallel processing and it's easy and managed, but there are much more common cases when you may use threads – UI or third party toolkits like QT, which often run callbacks in their own threads. And there is no other way to protect items() except of locking, even if you would try to prepare array out of iterator to make it faster, any parallel thread could break it by modification.

For myself I found only one good solution. Subclass dictionary and create thread safe version of it with locks around all critical operations: modifications and reads. If you want to make it more efficient you need separate read and write locks.

quietbritishjim · on March 6, 2019

> while array is prepared, dict couldn't be modified

I mentioned this exact situtation in my comment. In fact that's what most of my comment is about.

To repeat:

* I don't believe it actually is atomic (but I haven't checked ... have you?)

* Even if it is it wouldn't be guaranteed to be atomic in future versions of Python (ignoring the fact that future versions of Python no longer have items() with the same symantics).

* It won't be safe in other implementations of Python e.g. pypy

* It doesn't match other collections that you can iterate over that don't need an items() e.g. list

* (This one is new) It won't be safe in user-defined dict-like classes that define their own items() method, even if that method is supposed to have the same symantics.

Modifying an object in one thread while reading it in another is a bug, even it seems to work for now. Don't blame Python for making it slightly more likely to break. Just using use a flipping mutex!

speedplane · on March 9, 2019

Lets assume that evaluating a Python 2 items() call or list isn't atomic, and it would break some multi-threaded code. Even with that, there is a huge difference with iterators that can be passed around left and right, and be executed far after they are generated.

Using a non-threadsafe list, race conditions and other problems will likely crop up in CPU-bound applications. However, with iterators, that may get lazy executed far after they are created, race conditions are far more likely to occur.

As an example, consider the following program:

  for value in x.items():
    do_shared_network_or_disk_call(value)

If "x" is a list, there is definitely the possibility of race conditions cropping up. But if "x" is an iterator, the possibility of that increases dramatically. In a multi-threaded/processed environment, both are bad, but why would Python 3 try to make the situation worse?

varelaz · on March 6, 2019

It's atomic in CPython and protected with GIL. It will be safe in user-defined dict classes if you will make it safe and care of this. And everything above is a matter of implementation. What you write is pure and correct in common sense, but it's not practical. If you have thread safe data structure that care about its state consistency itself, why not to use it without locks and make things simplier? I don't talk about syncing state of several data structures etc. I'm talking about very simple use cases when it becomes very handy.

scbrg · on March 6, 2019

What do you mean by "threadsafe" here? Could dict.items() actually break in Python 2? I've never seen that happen.

quietbritishjim · on March 6, 2019

As I admitted in my comment, I'm not 100% sure that it's not protected by the GIL. If it's not, I wouldn't expect a hard crash if you mutate from another thread while iterating, but more like e.g. an item doesn't appear in the even though a different one had been removed by the other thread. But as I said in my comment, even if it does happen to be protected by the GIL, I think it's unsafe and fragile to rely on it.

d33 · on March 6, 2019

Unexpected/undefined behavior?

varelaz · on March 6, 2019

There are a lot of cases when you don't need strict consistency and current state is enough for processing. For example you want to save requests stats from web servers. Would you stop all operations until you counting and writing to DB to be precise? Off course not. Some current number that you have is good enough for you. Off course you need to be aware of side effects.

d33 · on March 6, 2019

Wouldn't you otherwise risk a dangerous race condition?

Many languages I know do that, for example C#.

whiskeyalpha · on March 6, 2019

Does anyone know a large Python application that is iterators all the way down which is not subtly broken?

I have never seen one.