Discussion:
Drawing lessons from fatal SELinux bug #1054350
(too old to reply)
Kevin Kofler
2014-01-23 23:55:23 UTC
Permalink
Hi,

it is time to analyze the fallout from the following catastrophic Fedora 20
regression:
https://bugzilla.redhat.com/show_bug.cgi?id=1054350
"rpm scriptlets are exiting with status 127"

The impact:
* EVERYONE with Fedora 20 installed with SELinux enabled and in enforcing
mode, and who updated to the current stable updates, was hit by this bug.
* The bug completely breaks upgrading any package through both GUI and CLI
tools. Even the fix itself cannot be installed correctly.
* The only possible workaround requires use of the command line. It is
IMPOSSIBLE to fix this using GUI tools installed by default. The
system-config-selinux tool which can be used to fix this in a pure GUI
method is NOT installed by default in Fedora 20 for some stupid reason
(because somebody decided to make it as painful as possible to disable that
SELinux junk? Now I have to install system-config-selinux first thing post-
install just so I can disable the dreaded thing), and of course cannot be
installed after the fact because of the bug. Normal users do not use
terminals, so they can only reinstall Fedora or (more likely) a competing
distribution (or even operating system)!
* The only possible workaround also requires root access to the machine.
PolicyKit policy allows all users to install official updates by default,
but those users then cannot fix the breakage without bothering an
administrator.
* As per the above, there are several installations that can be considered
BRICKED.
* We are losing users to Ubuntu because of this issue. People are explicitly
saying they are switching to Ubuntu because of this bug (e.g.
https://bugzilla.redhat.com/show_bug.cgi?id=1054312#c5 , later confirmed:
https://bugzilla.redhat.com/show_bug.cgi?id=1054312#c10 ), and I am sure
there are many more who are silently doing it without telling us.
* The bug now has 38 (!) duplicates in Bugzilla, plus many complaints on
IRC, mailing lists, comments to other unrelated bugs (the fix for which
cannot be installed due to the SELinux bug) etc.

So it is time to draw some lessons from this issue to prevent such a bug
from ever occurring again!

So, what happened:
* We are enabling SELinux enabled (enforcing) by default, a tool designed to
prevent anything it does not like from happening. (Reread this carefully:
The ONLY thing that tool is designed to do at all is PREVENT things. It does
not have a SINGLE feature other than being a roadblock and an annoyance.)
* SELinux works by shipping a "policy" that effectively tries to specify in
one single place (read: single point of failure!) everything any program in
Fedora (scalability disaster!) ever wants to do (second-guessing its actual
code, i.e., duplication of all logic!). (Note the 3 (!) major antipatterns
in a single-sentence (!) description of how SELinux works!)
* An update to that SELinux policy was shipped that BREAKS the most critical
tools in Fedora, the ones required to update the system and thus install the
fixes for any regressions, including the very regression that caused the
breakage. And also any automated workarounds are blocked by design.
* That update made it out to the stable updates! In other words, the
draconian Update Policies that were enacted in a vain attempt to prevent
such issues from happening utterly failed at catching this bug.

Meanwhile, SELinux is also causing similarly fatal issues in Rawhide:
https://bugzilla.redhat.com/show_bug.cgi?id=1052317
"selinux-policy preventing login through sddm and ssh"
which are still NOT fixed! At least in that case, RPM is apparently not
affected, but if you cannot log in to your system (SDDM is the default
display manager for KDE in Rawhide), it is totally unusable ("bricked")!

So, what needs to happen:
* SELinux must be disabled (or preferably, not installed in the first place,
to avoid wasting space for nothing) by default! Just consider the benefits
(none!) vs. the risks (what you are seeing now: bricked systems in both F20
and Rawhide, the users switching to other distributions). If we want to have
any users left, SELinux needs to go away NOW!
* The Update Policies must be repealed. This regression has shown us that
not only they totally failed at preventing it, but they are actively
contributing to exposing MORE users to broken updates by delaying regression
fixes. (This kind of regression fixes needs to go out DIRECTLY to stable!)

Last time an issue like that happened (the D-Bus regression that broke
updates), a big drama was made that ultimately lead to the (flawed) Update
Policies. And even a "catastrophe" that hit only a very small portion of our
users (those running the server part of bind) was used as a(n additional)
justification for the Update Policies, whereas this one now hits ALL users
who merely had the mishap of sticking to our flawed defaults (SELinux
enforcing). Why would we stick our heads in the sand this time?

DISABLE/DROP SELINUX NOW!

Thank you for your consideration,
Kevin Kofler

PS: I still recommend to ALL Fedora users to disable SELinux immediately
after installing Fedora. That is the most effective way to avoid ever being
hit by catastrophical breakage such as bug #1054350 or bug #1052317. But we
should not ship with a broken default in the first place!
Adam Williamson
2014-01-24 00:02:40 UTC
Permalink
Hi,
"catastrophic Fedora 20 regression"
https://bugzilla.redhat.com/show_bug.cgi?id=1054350
"rpm scriptlets are exiting with status 127"
"EVERYONE"
"IMPOSSIBLE" to fix this using GUI tools installed by default. The
"some stupid reason (because somebody decided to make it as painful as possible to disable that
SELinux junk?"
"dreaded thing"
"(or even operating system)!"
"BRICKED."
"The ONLY thing that tool is designed to do at all is PREVENT things. It does
not have a SINGLE feature"
(read: single point of failure!)
(scalability disaster!)
duplication of all logic!)
(Note the 3 (!) major antipatterns in a single-sentence (!) description of how SELinux works!)
BREAKS
draconian Update Policies
vain attempt
utterly failed
default! Just consider the benefits (none!)
If we want to have any users left, SELinux needs to go away NOW!
The Update Policies must be repealed
totally failed
Why would we stick our heads in the sand this time?
DISABLE/DROP SELINUX NOW!
That's a great way to go about having a calm and reasoned discussion and
building consensus, Kevin.
--
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
http://www.happyassassin.net
Eric Sandeen
2014-01-24 00:06:24 UTC
Permalink
Post by Kevin Kofler
* We are enabling SELinux enabled (enforcing) by default, a tool designed to
The ONLY thing that tool is designed to do at all is PREVENT things. It does
not have a SINGLE feature other than being a roadblock and an annoyance.)
In the same way that the lock on your front door is an annoyance, I guess.
Post by Kevin Kofler
* SELinux works by shipping a "policy" that effectively tries to specify in
one single place (read: single point of failure!) everything any program in
Fedora (scalability disaster!) ever wants to do (second-guessing its actual
code, i.e., duplication of all logic!). (Note the 3 (!) major antipatterns
in a single-sentence (!) description of how SELinux works!)
If you think SELinux is "duplicating all logic" in application code,
I do not think you quite grasp how SELinux works.

If the solution to every serious bug that slips through the cracks of a release
is to disable the package, over time we may not have much left in Fedora.

I know that pretty much all filesystems would be out by now. ;)

-Eric
Kevin Kofler
2014-01-24 00:37:08 UTC
Permalink
Post by Eric Sandeen
If the solution to every serious bug that slips through the cracks of a
release is to disable the package, over time we may not have much left in
Fedora.
But SELinux is the one package (OK, one of the few, along with, e.g.,
firewall stuff) whose removal would actually INCREASE Fedora's
functionality!

Kevin Kofler
Eric Sandeen
2014-01-24 00:50:23 UTC
Permalink
Post by Kevin Kofler
Post by Eric Sandeen
If the solution to every serious bug that slips through the cracks of a
release is to disable the package, over time we may not have much left in
Fedora.
But SELinux is the one package (OK, one of the few, along with, e.g.,
firewall stuff) whose removal would actually INCREASE Fedora's
functionality!
Sure, removing firewalls & selinux would be a serious enhancement
of functionality.

For malware botnets & spam hosts, especially...

-Eric
Post by Kevin Kofler
Kevin Kofler
Kevin Kofler
2014-01-24 01:34:08 UTC
Permalink
Post by Eric Sandeen
Sure, removing firewalls & selinux would be a serious enhancement
of functionality.
For malware botnets & spam hosts, especially...
That would mean that all the distributions that do not enable SELinux (nor
AppArmor) by default are all owned by botnets, not to mention the many
people who disable those "features". Yet, the only machines that get hit are
those that have not been updated for months if not years (often running
ancient EOL distributions, but not even having the last updates provided for
those). SELinux is by no means necessary to protect your machine (especially
a firewalled non-server machine). The firewall can be of some use (and I'm
not advocating dropping that by default), though ideally we shouldn't have
servers trying to listen to non-local connections by default in the first
place!

Kevin Kofler
Adam Williamson
2014-01-24 01:38:16 UTC
Permalink
Post by Kevin Kofler
though ideally we shouldn't have
servers trying to listen to non-local connections by default in the first
place!
Doing so is specifically forbidden by policy, and the few packages that
do it had to seek a special exemption from FESCo to do so.
--
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
http://www.happyassassin.net
David Beveridge
2014-01-24 01:37:47 UTC
Permalink
Post by Kevin Kofler
Post by Eric Sandeen
If the solution to every serious bug that slips through the cracks of a
release is to disable the package, over time we may not have much left in
Fedora.
But SELinux is the one package (OK, one of the few, along with, e.g.,
firewall stuff) whose removal would actually INCREASE Fedora's
functionality!
Increase for who? Unauthorised attackers?

SELinux is the primary reason for me and my company choosing
Fedora/Redhat over Ubuntu/Debian or other distributions in the first place.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.fedoraproject.org/pipermail/devel/attachments/20140124/8b73caf5/attachment.html>
Colin Walters
2014-01-24 03:43:26 UTC
Permalink
Post by Kevin Kofler
Last time an issue like that happened (the D-Bus regression that broke
updates),
That was my fault. Something which left an impact on me, you can be
sure. Like SELinux, DBus impacts everything nearly everything in early
userspace.

In fact, this particular regression is exactly one of the reasons I made
OSTree. With both this and the SELinux bug, knowing you can *always*
reboot into the previous system state and recover makes things
fundamentally better for a fast-moving system like Fedora.

Note OSTree is fully capable of *atomic* upgrades to SELinux - where
your running system is untouched, with the old policy. When you reboot,
you have the new policy, with the new daemon code.

With the RPM live updates model by default, you have old policy, until
rpm reloads it in the middle of a "transaction", restarts some daemons,
but not all of them, leaving you some *old* code with *new* policy -
something hard to test because you have to go out of your way to
reproduce it. You can't boot into that state directly.

Secondarily, on the server side, as many people have noted - this type
of thing can be caught by automated testing. It's not hardware
specific.

OSTree pairs extremely well with automated testing, because it allows
fast incremental updates for offline VMs. I've had this working for
about a year in the gnome-continuous context, and I can bring it to
rpm-ostree too. As in "a week or two".

So no, SELinux doesn't need to be disabled. We can make it much better
- if we step beyond the current philosophy of "test a package" to "test
many system states as atomic units".
Sérgio Basto
2014-01-24 04:18:17 UTC
Permalink
Post by Kevin Kofler
* SELinux must be disabled (or preferably, not installed in the first place,
to avoid wasting space for nothing) by default! Just consider the benefits
(none!) vs. the risks (what you are seeing now: bricked systems in both F20
and Rawhide, the users switching to other distributions). If we want to have
any users left, SELinux needs to go away NOW!
TBH: I always disable selinux , and yes, I vote on SELinux not be
install by default, not necessarily remove it.
Post by Kevin Kofler
* The Update Policies must be repealed. This regression has shown us that
not only they totally failed at preventing it, but they are actively
contributing to exposing MORE users to broken updates by delaying regression
fixes. (This kind of regression fixes needs to go out DIRECTLY to stable!)
Also agree, this critical packages should go directly to stable and or
we should be able to revoke it.

Best regards,
--
Sérgio M. B.
Adam Williamson
2014-01-24 04:20:57 UTC
Permalink
Post by Sérgio Basto
Post by Kevin Kofler
* The Update Policies must be repealed. This regression has shown us that
not only they totally failed at preventing it, but they are actively
contributing to exposing MORE users to broken updates by delaying regression
fixes. (This kind of regression fixes needs to go out DIRECTLY to stable!)
Also agree, this critical packages should go directly to stable and or
we should be able to revoke it.
TBH this has always been the one of Kevin's Big Book Of Update Policy
Complaints I find the most baffling. If we know you managed to screw up
your update once, why exactly would we just trust you to get it right
the *second* time without any testing?
--
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
http://www.happyassassin.net
Sérgio Basto
2014-01-24 04:49:36 UTC
Permalink
Post by Adam Williamson
Post by Sérgio Basto
Post by Kevin Kofler
* The Update Policies must be repealed. This regression has shown us that
not only they totally failed at preventing it, but they are actively
contributing to exposing MORE users to broken updates by delaying regression
fixes. (This kind of regression fixes needs to go out DIRECTLY to stable!)
Also agree, this critical packages should go directly to stable and or
we should be able to revoke it.
TBH this has always been the one of Kevin's Big Book Of Update Policy
Complaints I find the most baffling. If we know you managed to screw up
your update once, why exactly would we just trust you to get it right
the *second* time without any testing?
yeah , so revoke an update could be a better idea.
--
Sérgio M. B.
Rahul Sundaram
2014-01-24 06:35:08 UTC
Permalink
Hi
Post by Sérgio Basto
yeah , so revoke an update could be a better idea.
How would that work? We don't control the mirrors. Mirrors can choose to
sync anytime they want to and retain packages we have removed.

Rahul
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.fedoraproject.org/pipermail/devel/attachments/20140124/1cc78c0b/attachment.html>
Adam Williamson
2014-01-24 06:58:58 UTC
Permalink
Post by Sérgio Basto
Hi
yeah , so revoke an update could be a better idea.
How would that work? We don't control the mirrors. Mirrors can choose
to sync anytime they want to and retain packages we have removed.
Even if we can do it on the mirrors, we have no way to 'recall' a
package from systems where it's already been installed (of course in the
current case that wouldn't have worked anyway, but we're discussing the
generic case here).
--
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
http://www.happyassassin.net
Kevin Kofler
2014-01-24 12:39:33 UTC
Permalink
Post by Adam Williamson
Even if we can do it on the mirrors, we have no way to 'recall' a
package from systems where it's already been installed (of course in the
current case that wouldn't have worked anyway, but we're discussing the
generic case here).
Crazy idea of the day: Maybe our update tools should default to distro-sync
rather than update? Together with ensuring timestamp monotonicity on the
metadata (don't accept older metadata if you already have newer one), it
would allow easily pulling faulty updates (except when RPM is broken as in
this case, of course) and could even render the dreaded Epoch hack obsolete.

Kevin Kofler
Ralf Corsepius
2014-01-24 14:55:57 UTC
Permalink
Post by Kevin Kofler
Post by Adam Williamson
Even if we can do it on the mirrors, we have no way to 'recall' a
package from systems where it's already been installed (of course in the
current case that wouldn't have worked anyway, but we're discussing the
generic case here).
Crazy idea of the day: Maybe our update tools should default to distro-sync
rather than update?
No, for 2 reasons:

a) This would blow away all installed packages, which aren't available
in permanently enabled repos.
Most common such case is having selectively installed packages from
updates-testing, because users are facing problems with these packages'
nominal versions.

b) A much more common packaging bug class than the SELinux-case are
packages, which can not be uninstalled or downgraded or not be
downgraded properly. Classic such cases are packages with defective
rpm-scriptlets or with scriptlet which perform persistent changes.

Ralf
Reindl Harald
2014-01-24 15:06:21 UTC
Permalink
Post by Kevin Kofler
Post by Adam Williamson
Even if we can do it on the mirrors, we have no way to 'recall' a
package from systems where it's already been installed (of course in the
current case that wouldn't have worked anyway, but we're discussing the
generic case here).
Crazy idea of the day: Maybe our update tools should default to distro-sync
rather than update?
a) This would blow away all installed packages, which aren't available in permanently enabled repos
that is not true, try it out

otherwise some packages would be not installed on my machines after a dist-upgrade
namely the ones never came from any repo and installed locally
Most common such case is having selectively installed packages from updates-testing, because users are facing
problems with these packages' nominal versions
*that* is the reason not to do so because it would downgrade anything updated
explicitly from updates-testing,kde-testing,koji which would be a bad default

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 246 bytes
Desc: OpenPGP digital signature
URL: <http://lists.fedoraproject.org/pipermail/devel/attachments/20140124/398eb2b9/attachment.sig>
Ralf Corsepius
2014-01-24 15:40:01 UTC
Permalink
Post by Reindl Harald
Post by Kevin Kofler
Post by Adam Williamson
Even if we can do it on the mirrors, we have no way to 'recall' a
package from systems where it's already been installed (of course in the
current case that wouldn't have worked anyway, but we're discussing the
generic case here).
Crazy idea of the day: Maybe our update tools should default to distro-sync
rather than update?
a) This would blow away all installed packages, which aren't available in permanently enabled repos
that is not true, try it out
Been there many times.


Real world example with a package I maintain, which currently has an
update pending in updates-testing:


# yum install gumbo-parser
...
Installing : gumbo-parser-1.0-0.2.20131001gitd90ea2b.fc20.x86_64
...
[Note: updates-testing is disabled in
/etc/yum.repo.d/fedora-updates-testing.repo]


Now temporarily enable updates-testing to pull in the package from
updates-testing for testing:
# yum update --enablerepo=updates-testing gumbo-parser
...
Updating : gumbo-parser-1.0-0.2.20131204git87b99f2.fc20.x86_64
...


# yum distro-sync
...
Downgrading:
gumbo-parser x86_64
1.0-0.2.20131001gitd90ea2b.fc20 fedora
...
Removed:
gumbo-parser.x86_64 0:1.0-0.2.20131204git87b99f2.fc20



Installed:
gumbo-parser.x86_64 0:1.0-0.2.20131001gitd90ea2b.fc20
...
=>

qed


Ralf
Reindl Harald
2014-01-24 15:57:46 UTC
Permalink
Post by Ralf Corsepius
Post by Reindl Harald
a) This would blow away all installed packages, which aren't available in permanently enabled repos
that is not true, try it out
Been there many times
no, you did not and you did also not in your example below
Post by Ralf Corsepius
# yum distro-sync
...
gumbo-parser x86_64
1.0-0.2.20131001gitd90ea2b.fc20 fedora
...
gumbo-parser.x86_64 0:1.0-0.2.20131204git87b99f2.fc20
gumbo-parser.x86_64 0:1.0-0.2.20131001gitd90ea2b.fc20
nothing is blown away, you only did not read the output
because it was *downgraded* and *not* removed

this is *completly* different than "blown away"
this is what distro-sync *is supposed to do*
upgrade or downgrade any package which is in whatever current repo
but it *does not* blow away packages not in any repo at all
________________________________________

and if you would not have stripped this paragraph of my original
reply you maybe had looked twice
Post by Ralf Corsepius
Post by Reindl Harald
*that* is the reason not to do so because it would downgrade anything updated
explicitly from updates-testing,kde-testing,koji which would be a bad default
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 246 bytes
Desc: OpenPGP digital signature
URL: <http://lists.fedoraproject.org/pipermail/devel/attachments/20140124/ab30ddf4/attachment.sig>
Ralf Corsepius
2014-01-24 16:12:31 UTC
Permalink
Post by Reindl Harald
Post by Ralf Corsepius
Post by Reindl Harald
a) This would blow away all installed packages, which aren't available in permanently enabled repos
that is not true, try it out
Been there many times
no, you did not and you did also not in your example below
Post by Ralf Corsepius
# yum distro-sync
...
gumbo-parser x86_64
1.0-0.2.20131001gitd90ea2b.fc20 fedora
...
gumbo-parser.x86_64 0:1.0-0.2.20131204git87b99f2.fc20
gumbo-parser.x86_64 0:1.0-0.2.20131001gitd90ea2b.fc20
nothing is blown away, you only did not read the output
because it was *downgraded* and *not* removed
Rubbish - Stop being childish.
Post by Reindl Harald
this is *completly* different than "blown away"
this is what distro-sync *is supposed to do*
upgrade or downgrade any package which is in whatever current repo
but it *does not* blow away packages not in any repo at all
It if the package from updates-testing was fixing a critical bug on your
system, your system would be malfunctioning afterwards.
Reindl Harald
2014-01-24 16:31:56 UTC
Permalink
Post by Ralf Corsepius
Post by Reindl Harald
Post by Ralf Corsepius
Post by Reindl Harald
a) This would blow away all installed packages, which aren't available in permanently enabled repos
that is not true, try it out
Been there many times
no, you did not and you did also not in your example below
Post by Ralf Corsepius
# yum distro-sync
...
gumbo-parser x86_64
1.0-0.2.20131001gitd90ea2b.fc20 fedora
...
gumbo-parser.x86_64 0:1.0-0.2.20131204git87b99f2.fc20
gumbo-parser.x86_64 0:1.0-0.2.20131001gitd90ea2b.fc20
nothing is blown away, you only did not read the output
because it was *downgraded* and *not* removed
Rubbish - Stop being childish.
nobody here is childish, except maybe you
Post by Ralf Corsepius
Post by Reindl Harald
this is *completly* different than "blown away"
this is what distro-sync *is supposed to do*
upgrade or downgrade any package which is in whatever current repo
but it *does not* blow away packages not in any repo at all
It if the package from updates-testing was fixing a critical bug on your system, your system would be
malfunctioning afterwards
and exactly *that* was what i said in my first reply while you
stripped *exactly* that part out from your quote, most likely
because you replied with a reflex without read exactly 5 lines
completly

but that is *not* "a) This would blow away all installed packages, which aren't available in
permanently enabled repos" because that would mean *uninstall* any package which is currently
not in a enabled repo - and that is *not* what distro-sync does

below *again* my complete reply which is and was technical correct
while your "would blow away" is not

so before call others childish the next time before you reply to a message
read also the second pararaph to avoid useless discussions

-------- Original-Nachricht --------
Betreff: Re: Drawing lessons from fatal SELinux bug #1054350
Datum: Fri, 24 Jan 2014 16:06:21 +0100
Von: Reindl Harald <h.reindl at thelounge.net>
An: devel at lists.fedoraproject.org
Post by Ralf Corsepius
Post by Reindl Harald
Post by Ralf Corsepius
Even if we can do it on the mirrors, we have no way to 'recall' a
package from systems where it's already been installed (of course in the
current case that wouldn't have worked anyway, but we're discussing the
generic case here).
Crazy idea of the day: Maybe our update tools should default to distro-sync
rather than update?
a) This would blow away all installed packages, which aren't available in permanently enabled repos
that is not true, try it out

otherwise some packages would be not installed on my machines after a dist-upgrade
namely the ones never came from any repo and installed locally
Post by Ralf Corsepius
Most common such case is having selectively installed packages from updates-testing, because users are facing
problems with these packages' nominal versions
*that* is the reason not to do so because it would downgrade anything updated
explicitly from updates-testing,kde-testing,koji which would be a bad default

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 246 bytes
Desc: OpenPGP digital signature
URL: <http://lists.fedoraproject.org/pipermail/devel/attachments/20140124/8dbc6c1b/attachment.sig>
Kevin Kofler
2014-01-25 16:35:42 UTC
Permalink
Ralf, Harald, you both actually mean the same thing, you're just
misunderstanding each other due to inexact wording!

Yes, distro-sync will not remove packages which are not in the default-
enabled repositories at all (in any version) (nor will it downgrade them,
obviously, because there is no version to downgrade them to).

And yes, distro-sync WILL downgrade packages if the new version is not (or
not yet) available in a default-enabled repository.

It is clear that both of you know this, there was just a misunderstanding.

Kevin Kofler
Ralf Corsepius
2014-01-24 07:04:12 UTC
Permalink
Post by Sérgio Basto
Hi
yeah , so revoke an update could be a better idea.
How would that work? We don't control the mirrors.
Fedora controls advertising mirrors (mirrorlists) through
mirrormanager. I.e. mirrormanager only advertising mirrors, which are
in-sync should keep the risks fairly low.

Certainly, downgrading installations which already upgraded to faulty
packages would not work.

Ralf
Sergio Pascual
2014-01-24 09:58:22 UTC
Permalink
2014/1/24 Ralf Corsepius <rc040203 at freenet.de>
Post by Ralf Corsepius
Certainly, downgrading installations which already upgraded to faulty
packages would not work.
Ralf
The situation (a broken system that cannot be upgraded) could be mitigated
a little bit by using yum + system snapshots. You can rollback to a
previous sane system.

There is a plugin yum-plugin-fs-snapshot, but it requires better
documentation and system integration.

Currently (I don't know how current is F16 documentation) it requires
running lvm by hand

http://docs.fedoraproject.org/en-US/Fedora/16/html/System_Administrators_Guide/sec-Plugin_Descriptions.html

Sergio
Post by Ralf Corsepius
--
devel mailing list
devel at lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/devel
Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.fedoraproject.org/pipermail/devel/attachments/20140124/1bcf3b9c/attachment.html>
Adam Williamson
2014-01-24 17:41:13 UTC
Permalink
Post by Sergio Pascual
2014/1/24 Ralf Corsepius <rc040203 at freenet.de>
Certainly, downgrading installations which already upgraded to
faulty packages would not work.
Ralf
The situation (a broken system that cannot be upgraded) could be
mitigated a little bit by using yum + system snapshots. You can
rollback to a previous sane system.
There is a plugin yum-plugin-fs-snapshot, but it requires better
documentation and system integration.
Currently (I don't know how current is F16 documentation) it requires
running lvm by hand
http://docs.fedoraproject.org/en-US/Fedora/16/html/System_Administrators_Guide/sec-Plugin_Descriptions.html
AIUI there is/was a long-term plan to integrate this as core
functionality using btrfs snapshots - in fact that was one of the major
attractions of the idea of switching to btrfs-by-default in the first
place. I believe those involved didn't think the LVM-based
implementation was clean/robust enough to use by default, but a
btrfs-based implementation would be. Do correct me if I'm wrong.
--
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
http://www.happyassassin.net
Kevin Fenzi
2014-01-24 18:19:48 UTC
Permalink
On Fri, 24 Jan 2014 09:41:13 -0800
Post by Adam Williamson
AIUI there is/was a long-term plan to integrate this as core
functionality using btrfs snapshots - in fact that was one of the
major attractions of the idea of switching to btrfs-by-default in the
first place. I believe those involved didn't think the LVM-based
implementation was clean/robust enough to use by default, but a
btrfs-based implementation would be. Do correct me if I'm wrong.
I don't think snapshots are a partcularly good solution, unless there's
some way to only roll back the rpm/yum transaction without also rolling
back unrelated changes.

kevin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://lists.fedoraproject.org/pipermail/devel/attachments/20140124/36d0ea3b/attachment.sig>
Chris Murphy
2014-01-24 22:10:04 UTC
Permalink
Post by Kevin Fenzi
On Fri, 24 Jan 2014 09:41:13 -0800
Post by Adam Williamson
AIUI there is/was a long-term plan to integrate this as core
functionality using btrfs snapshots - in fact that was one of the
major attractions of the idea of switching to btrfs-by-default in the
first place. I believe those involved didn't think the LVM-based
implementation was clean/robust enough to use by default, but a
btrfs-based implementation would be. Do correct me if I'm wrong.
I don't think snapshots are a partcularly good solution, unless there's
some way to only roll back the rpm/yum transaction without also rolling
back unrelated changes.
If there is a directory that contains update and non-update related file changes, that's a problem. If there's segmentation, then this can be done.

Clearly /home needs to be separate (it's OK to take a snapshot but just don't use it by default in a rollback) or we lose changes in /home in a rollback from the time of the snapshot to the time of the decision to rollback.

Another possible case it's /etc/ where the either a package or the user could make changes during the update. Btrfs allows per file snapshots with cp --reflink so there might be a way to carve the snapshot with a scalpel but I prefer doing it with subvolume granularity. Plus that granularity translates to LVM.



Chris Murphy
Kevin Kofler
2014-01-25 16:41:57 UTC
Permalink
Post by Chris Murphy
If there is a directory that contains update and non-update related file
changes, that's a problem. If there's segmentation, then this can be done.
Clearly /home needs to be separate (it's OK to take a snapshot but just
don't use it by default in a rollback) or we lose changes in /home in a
rollback from the time of the snapshot to the time of the decision to
rollback.
Another possible case it's /etc/ where the either a package or the user
could make changes during the update.
There's also /root, and then the most annoying case: /var. /var/lib/rpm
definitely needs to be rolled back, but you DON'T want to roll back things
such as log files in /var/log or systemwide databases (other than the RPM
database).

Kevin Kofler
Chris Murphy
2014-01-25 23:36:19 UTC
Permalink
Post by Kevin Kofler
Post by Chris Murphy
If there is a directory that contains update and non-update related file
changes, that's a problem. If there's segmentation, then this can be done.
Clearly /home needs to be separate (it's OK to take a snapshot but just
don't use it by default in a rollback) or we lose changes in /home in a
rollback from the time of the snapshot to the time of the decision to
rollback.
Another possible case it's /etc/ where the either a package or the user
could make changes during the update.
There's also /root, and then the most annoying case: /var. /var/lib/rpm
definitely needs to be rolled back, but you DON'T want to roll back things
such as log files in /var/log or systemwide databases (other than the RPM
database).
Well it might be woefully ignorant for me to say, and seem like flamebaiting, but the mixing of such domains tells me that the FHS needs revision even outside of the context of snapshots. It's just that snapshots makes it more obvious the organization is deficient.

Another weird one for me is /var/lib/libvirt/images which I certainly wouldn't want to snapshot (more specifically I wouldn't want to rollback by default in the face of a bad update).

Another way of dealing with this is many more subvolumes so that they can be selectively snapshotted for rollbacks while others remain persistent. Again it's fine to snapshot them at the same time also, but the rollback behavior by default would only rollback the software updates.

For point of comparison, when choosing the default Btrfs layout opensuse's installer creates three partitions: swap, root (btrfs), home (btrfs). And creates the following subvolumes on root:

boot/grub2/x86_64-efi
home
opt
srv
tmp
usr/local
var/crash
var/log
var/opt
var/spool
var/tmp

There's more snapshot granularity available with this setup.

Chris Murphy
Tomasz Torcz
2014-01-25 16:46:09 UTC
Permalink
Post by Chris Murphy
Post by Kevin Fenzi
On Fri, 24 Jan 2014 09:41:13 -0800
Post by Adam Williamson
AIUI there is/was a long-term plan to integrate this as core
functionality using btrfs snapshots - in fact that was one of the
major attractions of the idea of switching to btrfs-by-default in the
first place. I believe those involved didn't think the LVM-based
implementation was clean/robust enough to use by default, but a
btrfs-based implementation would be. Do correct me if I'm wrong.
I don't think snapshots are a partcularly good solution, unless there's
some way to only roll back the rpm/yum transaction without also rolling
back unrelated changes.
If there is a directory that contains update and non-update related file
changes, that's a problem. If there's segmentation, then this can be done.
Clearly /home needs to be separate (it's OK to take a snapshot but just don't
use it by default in a rollback) or we lose changes in /home in a rollback from
the time of the snapshot to the time of the decision to rollback.
Another possible case it's /etc/ where the either a package or the user could
make changes during the update. Btrfs allows per file snapshots with cp
--reflink so there might be a way to carve the snapshot with a scalpel but I
prefer doing it with subvolume granularity. Plus that granularity translates to
LVM.
Note that this situation is perfectly handled by Offline Updates.
After reboot, there aren't collateral changes to filesystem, only upgrade-related
ones. So if there's a need for revert, the previous state is clearly defined.
--
Tomasz Torcz There exists no separation between gods and men:
xmpp: zdzichubg at chrome.pl one blends softly casual into the other.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
URL: <http://lists.fedoraproject.org/pipermail/devel/attachments/20140125/7d49fc52/attachment.sig>
Reindl Harald
2014-01-25 16:50:19 UTC
Permalink
Post by Tomasz Torcz
Note that this situation is perfectly handled by Offline Updates.
After reboot, there aren't collateral changes to filesystem, only upgrade-related
ones. So if there's a need for revert, the previous state is clearly defined
says who?

UsrMove was as example forced with the excuse to support this as well
as /usr on a own partition beause one snapshot of the system


so and now imagine a common setup

* /boot
* /
* /var

have fun with restore your snapshot or / or /usr where you bomb the rootfs back
and the rpmdb is still like the restore never did happen


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 246 bytes
Desc: OpenPGP digital signature
URL: <http://lists.fedoraproject.org/pipermail/devel/attachments/20140125/d917e77b/attachment.sig>
Simo Sorce
2014-01-25 19:55:32 UTC
Permalink
Post by Tomasz Torcz
Post by Chris Murphy
Post by Kevin Fenzi
On Fri, 24 Jan 2014 09:41:13 -0800
Post by Adam Williamson
AIUI there is/was a long-term plan to integrate this as core
functionality using btrfs snapshots - in fact that was one of the
major attractions of the idea of switching to btrfs-by-default in the
first place. I believe those involved didn't think the LVM-based
implementation was clean/robust enough to use by default, but a
btrfs-based implementation would be. Do correct me if I'm wrong.
I don't think snapshots are a partcularly good solution, unless there's
some way to only roll back the rpm/yum transaction without also rolling
back unrelated changes.
If there is a directory that contains update and non-update related file
changes, that's a problem. If there's segmentation, then this can be done.
Clearly /home needs to be separate (it's OK to take a snapshot but just don't
use it by default in a rollback) or we lose changes in /home in a rollback from
the time of the snapshot to the time of the decision to rollback.
Another possible case it's /etc/ where the either a package or the user could
make changes during the update. Btrfs allows per file snapshots with cp
--reflink so there might be a way to carve the snapshot with a scalpel but I
prefer doing it with subvolume granularity. Plus that granularity translates to
LVM.
Note that this situation is perfectly handled by Offline Updates.
After reboot, there aren't collateral changes to filesystem, only upgrade-related
ones. So if there's a need for revert, the previous state is clearly defined.
Sorry, but this is simply not true.

I would really like to DISABUSE people of the notion that automated (or
not) rollbacks can be easily done in bulk, by the magic wand of file
system snapshots.

The ONLY way to do that is if you do not care at all about user's data
and simply accept that a rollback will also remove user data.

The reason is simple: lot's of software *changes* data as part of its
normal functioning, including and often in rollback-incompatible ways.

You cannot assume that upgrading a program that uses a database X from
version A to B can still work if you keep database X unchanged and then
rollback from B to A. Lot of applications apply changes to database at
upgrade time, either in the rpm scriplets or automatically as soon as a
new version binary is run.

It is basically impossible to find applications that handle the case
where you downgrade, in any more graceful way than punting and failing
to start in the *good* case. In the bad case they start and trash the
database.

And by database, do not think SQL/NOSQL engines only, it can be any
simple dataset in a file, including configuration files in user's homes.

Simo.
--
Simo Sorce * Red Hat, Inc * New York
Colin Walters
2014-01-25 20:04:54 UTC
Permalink
Hi Simo,
Post by Simo Sorce
The reason is simple: lot's of software *changes* data as part of its
normal functioning, including and often in rollback-incompatible ways.
I wrote and maintain a system that has been doing continuous deployment
of GNOME. It's been running for over a year, and is nearing it's
10000th build.

I have "rolled" back many times - both on the server side, and on the
client side. Here's one I *just did* a few minutes ago because vala git
master broke the build of gnome-calculator:

https://git.gnome.org/browse/gnome-continuous/commit/?id=32a52e53100e92aad5d2dfae969be82227322f49

That's me telling the system "please stop building git master, and
freeze to this specific commit". All clients get that change when they
upgrade - OSTree cares not at all for version numbers.

The vala maintainers continue to work out the issue in git master, and I
continue to ship a working system. Double win.

Now it's true, programs in GNOME do sometimes make the type of data
format transition you're talking about. Evolution has done it at least
twice.

But you know what? My real world experience has been that having the
ability to roll back has *far* *far* *far* outweighed the downsides when
applications do format transitions. It's comparatively rare.

Far more people are bit by things like hardware-specific issues where
gnome-shell fails to render on this particular card - and rollback works
beautifully for that.
Tomasz Torcz
2014-01-25 22:26:22 UTC
Permalink
Post by Simo Sorce
Post by Tomasz Torcz
Post by Chris Murphy
If there is a directory that contains update and non-update related file
changes, that's a problem. If there's segmentation, then this can be done.
Note that this situation is perfectly handled by Offline Updates.
After reboot, there aren't collateral changes to filesystem, only upgrade-related
ones. So if there's a need for revert, the previous state is clearly defined.
Sorry, but this is simply not true.
The ONLY way to do that is if you do not care at all about user's data
and simply accept that a rollback will also remove user data.
The reason is simple: lot's of software *changes* data as part of its
normal functioning, including and often in rollback-incompatible ways.
What user data? There is no user data touched/created during offline upgrade.
--
Tomasz Torcz There exists no separation between gods and men:
xmpp: zdzichubg at chrome.pl one blends softly casual into the other.
Reindl Harald
2014-01-25 22:29:15 UTC
Permalink
Post by Tomasz Torcz
Post by Simo Sorce
The ONLY way to do that is if you do not care at all about user's data
and simply accept that a rollback will also remove user data.
The reason is simple: lot's of software *changes* data as part of its
normal functioning, including and often in rollback-incompatible ways.
What user data? There is no user data touched/created during offline upgrade
and what is with the data *between* the upgrade and decision to roll back?

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 246 bytes
Desc: OpenPGP digital signature
URL: <http://lists.fedoraproject.org/pipermail/devel/attachments/20140125/59a01d27/attachment.sig>
Adam Williamson
2014-01-25 23:12:38 UTC
Permalink
Post by Tomasz Torcz
Post by Simo Sorce
Post by Tomasz Torcz
Post by Chris Murphy
If there is a directory that contains update and non-update related file
changes, that's a problem. If there's segmentation, then this can be done.
Note that this situation is perfectly handled by Offline Updates.
After reboot, there aren't collateral changes to filesystem, only upgrade-related
ones. So if there's a need for revert, the previous state is clearly defined.
Sorry, but this is simply not true.
The ONLY way to do that is if you do not care at all about user's data
and simply accept that a rollback will also remove user data.
The reason is simple: lot's of software *changes* data as part of its
normal functioning, including and often in rollback-incompatible ways.
What user data? There is no user data touched/created during offline upgrade.
No, but you may have to use the system somewhat before you can find out
there was a problem with the upgrade, and at *that* point, your user
data may now be tied to the new versions of system apps, as Simo
describes.

So, it goes like this:

* Do an offline update that includes Foo v2.0
* Boot the updated system, run Foo, it migrates its configuration to
some new scheme
* Realize there was something wrong with the update, roll it back
* Run Foo again, find it doesn't work because it's been migrated to the
new config scheme which the old version of Foo doesn't work with
--
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
http://www.happyassassin.net
Chris Murphy
2014-01-25 23:42:13 UTC
Permalink
Post by Tomasz Torcz
Post by Chris Murphy
Another possible case it's /etc/ where the either a package or the user could
make changes during the update. Btrfs allows per file snapshots with cp
--reflink so there might be a way to carve the snapshot with a scalpel but I
prefer doing it with subvolume granularity. Plus that granularity translates to
LVM.
Note that this situation is perfectly handled by Offline Updates.
After reboot, there aren't collateral changes to filesystem, only upgrade-related
ones. So if there's a need for revert, the previous state is clearly defined.
I don't follow this. The realization an update is bad doesn't necessarily occur right away. So we still need a way to separate system domain vs user domain, at least, so that system files are rolled back separately from user files.


Chris Murphy

Chris Murphy
2014-01-24 21:38:04 UTC
Permalink
Post by Sergio Pascual
2014/1/24 Ralf Corsepius <rc040203 at freenet.de>
Certainly, downgrading installations which already upgraded to faulty packages would not work.
Ralf
The situation (a broken system that cannot be upgraded) could be mitigated a little bit by using yum + system snapshots. You can rollback to a previous sane system.
This is non-trivial for the typical user. And as far as I'm aware there's no storage SIG to define a best practices, so at the moment there are 19 recipes per snapshot technology. So one person can explain how they do it, then the user gets into some trouble with a question and no other user can really answer the question because they do it a different way.
Post by Sergio Pascual
There is a plugin yum-plugin-fs-snapshot, but it requires better documentation and system integration.
Well I'd go a step further and ask some more basic questions how how many snapshots should be bootable, whether systemd-journal should be persistent across snapshots or snapshot specific, what exactly are we snapshotting, can we require /home be separate (presently we don't require it) in order to support such bootable snapshots, on and on.

I'm not so sure the plugin needs an update or replacing or a separate user space program that helps manage the moving parts: the snapshot creation, the update, altering fstab and grub.cfg as needed, and even what the bootable snapshot options should look like and where: we have three kernels to choose from in GRUB, as soon as there's one snapshot, we might have three identical kernel versions each with two sysroots; or possibly there are four kernels, one only boots the old sysroot, one only boots the new sysroot and two could boot either sysroot. And that's just with one snapshot, as soon as there are accumulating snapshots to boot, I start looking for blood because I don't have enough going to my brain as it is.
Post by Sergio Pascual
Currently (I don't know how current is F16 documentation) it requires running lvm by hand
http://docs.fedoraproject.org/en-US/Fedora/16/html/System_Administrators_Guide/sec-Plugin_Descriptions.html
Right and there's a legitimate question how useful conventional LVM snapshots are, just because at one time they were the only thing we had, but they're slow and inefficient and you wouldn't want to use them for very long as that's not what they were designed for. Whereas LVM thinp snapshots are completely different, useable, accumulatble over the long haul, do not require preallocation, very much like Btrfs snapshots but of course a different implementation. And then Btrfs snapshots are dead nuts simple to create and remove compared to thinp - *and* they are directly grub2 bootable unlike LVM thinp.


Chris Murphy

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.fedoraproject.org/pipermail/devel/attachments/20140124/09c95e2b/attachment.html>
Josh Stone
2014-01-24 23:16:52 UTC
Permalink
On Jan 24, 2014, at 2:58 AM, Sergio Pascual <sergio.pasra at gmail.com
Post by Sergio Pascual
There is a plugin yum-plugin-fs-snapshot, but it requires better
documentation and system integration.
Well I'd go a step further and ask some more basic questions how how
many snapshots should be bootable, whether systemd-journal should be
persistent across snapshots or snapshot specific, what exactly are we
snapshotting, can we require /home be separate (presently we don't
require it) in order to support such bootable snapshots, on and on.
I'd also ask where we keep these snapshots, and how do you prevent
access to them normally. IIRC, yum-plugin-fs-snapshot makes btrfs
snapshots as a subvolume directly within the filesystem, which means it
will still be accessible.

This concerns me especially in the case of security updates -- for
example, a vulnerable setuid-root binary should be locked up tight!

Josh
Chris Murphy
2014-01-25 01:27:13 UTC
Permalink
Post by Josh Stone
On Jan 24, 2014, at 2:58 AM, Sergio Pascual <sergio.pasra at gmail.com
Post by Sergio Pascual
There is a plugin yum-plugin-fs-snapshot, but it requires better
documentation and system integration.
Well I'd go a step further and ask some more basic questions how how
many snapshots should be bootable, whether systemd-journal should be
persistent across snapshots or snapshot specific, what exactly are we
snapshotting, can we require /home be separate (presently we don't
require it) in order to support such bootable snapshots, on and on.
I'd also ask where we keep these snapshots, and how do you prevent
access to them normally. IIRC, yum-plugin-fs-snapshot makes btrfs
snapshots as a subvolume directly within the filesystem, which means it
will still be accessible.
It's possible the utility could mount another subvolume not in the present path, including top level ID 5 (the first and default subvolume by mkfs.btrfs), and place snapshots there, and then unmount.

For LVM thinp snapshots, they become completely out of tree LVs. They aren't accessible unless explicitly mounted.
Post by Josh Stone
This concerns me especially in the case of security updates -- for
example, a vulnerable setuid-root binary should be locked up tight!
The organization question is valid. But sudo or root could just mount any subvolume. However, btrfs read-only snapshots can't be written to even by root. Naturally root could just create a rw snapshot of a ro snapshot and then delete the ro snapshot, but an audit probably ought to show the subvolume UUIDs and creation dates involved so that we'd know this is what happened.


Chris Murphy
Josh Stone
2014-01-25 04:40:28 UTC
Permalink
Post by Chris Murphy
Post by Josh Stone
This concerns me especially in the case of security updates -- for
example, a vulnerable setuid-root binary should be locked up tight!
The organization question is valid. But sudo or root could just mount
any subvolume. However, btrfs read-only snapshots can't be written to
even by root. Naturally root could just create a rw snapshot of a ro
snapshot and then delete the ro snapshot, but an audit probably ought
to show the subvolume UUIDs and creation dates involved so that we'd
know this is what happened.
My point was not about what root can do. Suppose there's a vulnerable
'sudo' binary that gives everyone a root shell. If that binary is
available on any executable path, even readonly, that's trouble.

As you say, LVM snapshots are out of view, but with btrfs it needs to be
an inaccessible subvolume path, or mounted noexec, etc.
Bruno Wolff III
2014-01-25 14:03:52 UTC
Permalink
On Fri, Jan 24, 2014 at 20:40:28 -0800,
Post by Josh Stone
My point was not about what root can do. Suppose there's a vulnerable
'sudo' binary that gives everyone a root shell. If that binary is
available on any executable path, even readonly, that's trouble.
That isn't true. File systems can be mounted such that suid bits are
ignored. suid executables on such file systems are effectively just
normal executables.
Josh Stone
2014-01-25 18:37:48 UTC
Permalink
Post by Bruno Wolff III
On Fri, Jan 24, 2014 at 20:40:28 -0800,
Post by Josh Stone
My point was not about what root can do. Suppose there's a vulnerable
'sudo' binary that gives everyone a root shell. If that binary is
available on any executable path, even readonly, that's trouble.
That isn't true. File systems can be mounted such that suid bits are
ignored. suid executables on such file systems are effectively just
normal executables.
Ok, sure, you can mount -o nosuid,noexec,nodev ... but this isn't the
default for btrfs subvolume paths AFAIK. It needs to be a conscious
decision in whatever snapshot design we choose.
Colin Walters
2014-01-25 19:32:11 UTC
Permalink
Post by Josh Stone
Ok, sure, you can mount -o nosuid,noexec,nodev ... but this isn't the
default for btrfs subvolume paths AFAIK. It needs to be a conscious
decision in whatever snapshot design we choose.
This is definitely an issue with the OSTree design, since everything
shares a physical partition (you can choose whatever block storage you
want) - it's just hard links.

I just filed:
https://bugzilla.gnome.org/show_bug.cgi?id=722984
for this.

But really, now that KDBus is on the way, we can start using it for
system services to replace many setuid binaries, like unix_chkpwd
without losing the auditing trail and such that old indirection via
dbus-daemon required. That's a subject for a different thread though.
Simo Sorce
2014-01-25 20:05:34 UTC
Permalink
Post by Colin Walters
Post by Josh Stone
Ok, sure, you can mount -o nosuid,noexec,nodev ... but this isn't the
default for btrfs subvolume paths AFAIK. It needs to be a conscious
decision in whatever snapshot design we choose.
This is definitely an issue with the OSTree design, since everything
shares a physical partition (you can choose whatever block storage you
want) - it's just hard links.
https://bugzilla.gnome.org/show_bug.cgi?id=722984
for this.
I forgot by gnome bugzilla password (again) so before I forget:
do not use .files or such it quickly becomes a mess. If you need to
annotate this kind of things I humbly suggest you add an xattr to the
file namespaced to ostree.

Alternatively, if you do not want to touch the original file at all,
keep a separate database where you note all these things, it will make
for a faster lookup in case you need bulk operations instead of having
to troll the whole tree.
Post by Colin Walters
But really, now that KDBus is on the way, we can start using it for
system services to replace many setuid binaries, like unix_chkpwd
without losing the auditing trail and such that old indirection via
dbus-daemon required. That's a subject for a different thread though.
This is a good point, but a number of binaries are that way for legacy
reasons, or come from upstreams that care for portability and can't rely
on dbus (yet), so I think you need to care for the problem anyway.

Simo.
--
Simo Sorce * Red Hat, Inc * New York
Chris Murphy
2014-01-25 20:15:28 UTC
Permalink
Post by Josh Stone
Post by Chris Murphy
Post by Josh Stone
This concerns me especially in the case of security updates -- for
example, a vulnerable setuid-root binary should be locked up tight!
The organization question is valid. But sudo or root could just mount
any subvolume. However, btrfs read-only snapshots can't be written to
even by root. Naturally root could just create a rw snapshot of a ro
snapshot and then delete the ro snapshot, but an audit probably ought
to show the subvolume UUIDs and creation dates involved so that we'd
know this is what happened.
My point was not about what root can do. Suppose there's a vulnerable
'sudo' binary that gives everyone a root shell. If that binary is
available on any executable path, even readonly, that's trouble.
OK, so is the fact it's persistently available the problem? Because if I were to have a persistent backup of sysroot mounted, I've got the same attack vector available. By default for even an unprivileged user gnome-shell mounts with By default, gnome-shell mounts volumes with rw,nosuid,nodev,relatime,seclabel,uhelper=udisks2.
Post by Josh Stone
As you say, LVM snapshots are out of view, but with btrfs it needs to be
an inaccessible subvolume path, or mounted noexec, etc.
To make inaccessible: mount a subvol outside of the presently mounted path, snapshot, umount.

Seems like I can independently mount subvolumes with noexec:

49 37 0:45 /isos /mnt/isos rw,relatime shared:35 - btrfs /dev/sdb rw,seclabel,compress=lzo,space_cache
177 37 0:45 /archive /mnt/root rw,noexec,relatime shared:159 - btrfs /dev/sdb rw,seclabel,compress=lzo,space_cache

So another possibility is to have a "snapshots" subvolume persistently mounted, with noexec, and always place snapshots in that subvolume.



Chris Murphy
Kevin Kofler
2014-01-25 16:45:16 UTC
Permalink
Post by Sergio Pascual
The situation (a broken system that cannot be upgraded) could be
mitigated a little bit by using yum + system snapshots. You can rollback
to a previous sane system.
The big problem with that approach (other than the granularity issue already
pointed out) is disk space. Even with a smart snapshotting technology that
really only keeps on disk exactly what changed, it still requires a lot of
extra disk space.

Kevin Kofler
Kevin Kofler
2014-01-24 12:36:22 UTC
Permalink
Post by Adam Williamson
TBH this has always been the one of Kevin's Big Book Of Update Policy
Complaints I find the most baffling. If we know you managed to screw up
your update once, why exactly would we just trust you to get it right
the *second* time without any testing?
* If the package is already so screwed that it breaks the whole system, it
cannot realistically get any worse.
* A regression fix is usually a trivial change, often reverting something to
a previous, already well-tested, state.

Kevin Kofler
Adam Williamson
2014-01-24 17:42:28 UTC
Permalink
Post by Kevin Kofler
Post by Adam Williamson
TBH this has always been the one of Kevin's Big Book Of Update Policy
Complaints I find the most baffling. If we know you managed to screw up
your update once, why exactly would we just trust you to get it right
the *second* time without any testing?
* If the package is already so screwed that it breaks the whole system, it
cannot realistically get any worse.
Sure it can. It can wipe all your data, or mail it to the NSA...

"Breaks the whole system" is high on the Pantscon Scale, sure, but it's
not the highest. Data loss and security compromise both come higher.
Post by Kevin Kofler
* A regression fix is usually a trivial change, often reverting something to
a previous, already well-tested, state.
Sure. And what could possibly go wrong.
--
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
http://www.happyassassin.net
Kevin Kofler
2014-01-25 16:56:33 UTC
Permalink
Post by Adam Williamson
Post by Kevin Kofler
* If the package is already so screwed that it breaks the whole system,
it cannot realistically get any worse.
Sure it can. It can wipe all your data, or mail it to the NSA...
That's why I said "realistically". What is the probability of something like
that happening IN PRACTICE from a trivial change, especially one that
reverts something to a known previous state? Perfect QA does not exist
anyway, so it is all a matter of probabilities.
Post by Adam Williamson
"Breaks the whole system" is high on the Pantscon Scale, sure, but it's
not the highest. Data loss and security compromise both come higher.
And you seriously think a week (or 2) of testing can catch that? Security
vulnerabilities can take years to get noticed, creeping data loss days to
weeks. The only thing in your catastrophe scale that can be noticed in 1-2
weeks is blatant "wipes entire directories immediately" data loss, something
extremely unlikely to happen from a regression fix (but so are the other
catastrophic scenarios, unless the problem was already there before the
regression fix and is thus no reason to withhold the fix).
Post by Adam Williamson
Post by Kevin Kofler
* A regression fix is usually a trivial change, often reverting something
to a previous, already well-tested, state.
Sure. And what could possibly go wrong.
The risk of a catastrophe as you described happening needs to be estimated
(a tiny fraction of a percent) and weighed against the breakage (and denial
of service, if security terms are the only ones you understand!) of keeping
the broken update in stable for longer (and thus also letting it affect more
users). I think it is blatantly obvious that the first theoretical risk is
the much better one to take compared to the second, very practical one.

Kevin Kofler
drago01
2014-01-24 10:54:24 UTC
Permalink
Post by Kevin Kofler
* We are enabling SELinux enabled (enforcing) by default, a tool designed to
The ONLY thing that tool is designed to do at all is PREVENT things. It does
not have a SINGLE feature other than being a roadblock and an annoyance.)
The "feature" is called security. By your logic everyone should be
root, we should
disable other security features like ASLR and NX (both PREVENT me from running
malicious code but do not add a SINGLE feature).

So please read on how security is implemented and why.
Post by Kevin Kofler
* SELinux works by shipping a "policy" that effectively tries to specify in
one single place (read: single point of failure!) everything any program in
Fedora (scalability disaster!) ever wants to do (second-guessing its actual
code, i.e., duplication of all logic!).
That's not how it works not how it supposed to work. Please read on MAC.
Post by Kevin Kofler
(Note the 3 (!) major antipatterns
in a single-sentence (!) description of how SELinux works!)
Not a description on how it works but your misunderstand.
Post by Kevin Kofler
* An update to that SELinux policy was shipped that BREAKS the most critical
tools in Fedora, the ones required to update the system and thus install the
fixes for any regressions, including the very regression that caused the
breakage. And also any automated workarounds are blocked by design.
No idea what "automated workaround" means but there are other ways to
deal with it see Colin's post.
Post by Kevin Kofler
* That update made it out to the stable updates! In other words, the
draconian Update Policies that were enacted in a vain attempt to prevent
such issues from happening utterly failed at catching this bug.
Yeah so we should find out why this happened and improve the testing
procedures to not let it happen in the feature (again see Colin's mail).
Post by Kevin Kofler
* SELinux must be disabled (or preferably, not installed in the first place,
to avoid wasting space for nothing) by default! Just consider the benefits
(none!)
As stated above that's not true.
Post by Kevin Kofler
* The Update Policies must be repealed. This regression has shown us that
not only they totally failed at preventing it, but they are actively
contributing to exposing MORE users to broken updates by delaying regression
fixes. (This kind of regression fixes needs to go out DIRECTLY to stable!)
This is a contradiction "our current testing didn't find the bug so
how about we do no testing at all".
Kevin Kofler
2014-01-24 12:56:55 UTC
Permalink
Post by drago01
The "feature" is called security. By your logic everyone should be
root,
For home user machines, that wouldn't necessarily be a bad thing (but it
would mean fixing the software that special-cases the root user improperly
for no good reason).

Alternatively, the kernel could be patched to give "admin users" (either
defined as members of the "wheel" group as now, or by some additional
property that would be set for the same users by default) some strategic
capabilities such as dac_override. That would also put an end to the endless
annoyance of having to sudo all the time. (And by the way, sudo and
PolicyKit actions should be allowed with no password (rather than the user
password as now) for wheel group members by default.) That way, you still
get the benefits from different accounts, e.g., different preferences per
family member, without the current restrictions imposed to "normal" users.

The endless password prompts make a lot of sense in controlled corporate
environments with dedicated system administrators, but on home machines,
they are just an unnecessary annoyance.
Post by drago01
Post by Kevin Kofler
* SELinux works by shipping a "policy" that effectively tries to specify
in one single place (read: single point of failure!) everything any
program in Fedora (scalability disaster!) ever wants to do
(second-guessing its actual code, i.e., duplication of all logic!).
That's not how it works not how it supposed to work. Please read on MAC.
Uh, I know how it works. The above is how I summarize it. If you think that
is incorrect, please explain HOW.
Post by drago01
Post by Kevin Kofler
* An update to that SELinux policy was shipped that BREAKS the most
critical tools in Fedora, the ones required to update the system and thus
install the fixes for any regressions, including the very regression that
caused the breakage. And also any automated workarounds are blocked by
design.
No idea what "automated workaround" means but there are other ways to
deal with it see Colin's post.
A %pretrans scriptlet that fixes the problem without manual user
intervention (other than OKing the update). But SELinux won't allow RPMs to
mess with it that way (especially without invoking an external executable,
which is blocked by the faulty policy) because it would defeat its flawed
security model.
Post by drago01
Yeah so we should find out why this happened and improve the testing
procedures to not let it happen in the feature (again see Colin's mail).
NO amount of testing is going to prevent regressions from happening
occasionally. This means:
* we need to eliminate common sources of regressions such as SELinux, to
prevent whole classes of regressions from occurring in the first place
(prevention is better than duct tape!) and
* we have to accept that regressions can always happen and allow for fast
fixes to those regressions (direct stable pushes).
Post by drago01
Post by Kevin Kofler
* SELinux must be disabled (or preferably, not installed in the first
place, to avoid wasting space for nothing) by default! Just consider the
benefits (none!)
As stated above that's not true.
As stated above, that IS true. :-)
Post by drago01
Post by Kevin Kofler
* The Update Policies must be repealed. This regression has shown us that
not only they totally failed at preventing it, but they are actively
contributing to exposing MORE users to broken updates by delaying
regression fixes. (This kind of regression fixes needs to go out DIRECTLY
to stable!)
This is a contradiction "our current testing didn't find the bug so
how about we do no testing at all".
There is no contradiction. Doing away with policies that do not work is
perfectly logical, as is allowing quick regression fixes because regressions
do happen no matter how much you test.

Kevin Kofler
Reindl Harald
2014-01-24 13:40:30 UTC
Permalink
Post by Kevin Kofler
Alternatively, the kernel could be patched to give "admin users" (either
defined as members of the "wheel" group as now, or by some additional
property that would be set for the same users by default) some strategic
capabilities such as dac_override. That would also put an end to the endless
annoyance of having to sudo all the time. (And by the way, sudo and
PolicyKit actions should be allowed with no password (rather than the user
password as now) for wheel group members by default.) That way, you still
get the benefits from different accounts, e.g., different preferences per
family member, without the current restrictions imposed to "normal" users.
The endless password prompts make a lot of sense in controlled corporate
environments with dedicated system administrators, but on home machines,
they are just an unnecessary annoyance
no, they are not, they have the same reason as firefox asks
for the master-password before display stored passwords even
after you already entered it to login somewhere

they prevent that if you are not alone that while you go to
the toilet and forget to lock your screen unauthorized people
not doing things nobody wants on the machine

what you propose is the Apple way - not on a linux system please


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 246 bytes
Desc: OpenPGP digital signature
URL: <http://lists.fedoraproject.org/pipermail/devel/attachments/20140124/dc2ca123/attachment.sig>
Simo Sorce
2014-01-24 14:25:05 UTC
Permalink
Post by Reindl Harald
Post by Kevin Kofler
Alternatively, the kernel could be patched to give "admin users" (either
defined as members of the "wheel" group as now, or by some additional
property that would be set for the same users by default) some strategic
capabilities such as dac_override. That would also put an end to the endless
annoyance of having to sudo all the time. (And by the way, sudo and
PolicyKit actions should be allowed with no password (rather than the user
password as now) for wheel group members by default.) That way, you still
get the benefits from different accounts, e.g., different preferences per
family member, without the current restrictions imposed to "normal" users.
The endless password prompts make a lot of sense in controlled corporate
environments with dedicated system administrators, but on home machines,
they are just an unnecessary annoyance
no, they are not, they have the same reason as firefox asks
for the master-password before display stored passwords even
after you already entered it to login somewhere
they prevent that if you are not alone that while you go to
the toilet and forget to lock your screen unauthorized people
not doing things nobody wants on the machine
Worse than that, they prevent automated attacks via very vulnerable
applications like browsers. [which of course in Kevin's world are never
run in a SELinux sandbox]

So you if you get some malware to jailbreak out of the browser sandbox
all it needs to do is "sudo pwnme" if there is no password request.

Of course you need to understand at least a smidget of security to avoid
proposing ludicrous 'defaults'.
Post by Reindl Harald
what you propose is the Apple way - not on a linux system please
It is just 'the pwn me' way, nothing more, nothing less.
--
Simo Sorce * Red Hat, Inc * New York
Fabian Deutsch
2014-01-24 18:12:52 UTC
Permalink
Post by Kevin Kofler
it is time to analyze the fallout from the following catastrophic Fedora 20
https://bugzilla.redhat.com/show_bug.cgi?id=1054350
"rpm scriptlets are exiting with status 127"
Hey,

can't we add a default boot entry which starts the system in permissive
mode?

- fabian
drago01
2014-01-24 18:18:20 UTC
Permalink
Post by Fabian Deutsch
Post by Kevin Kofler
it is time to analyze the fallout from the following catastrophic Fedora 20
https://bugzilla.redhat.com/show_bug.cgi?id=1054350
"rpm scriptlets are exiting with status 127"
Hey,
can't we add a default boot entry which starts the system in permissive
mode?
How would that help? If a user knows enough about the issue to try it
he/she could just switch to permissive mode.
Reindl Harald
2014-01-24 18:31:33 UTC
Permalink
Post by drago01
Post by Fabian Deutsch
Post by Kevin Kofler
it is time to analyze the fallout from the following catastrophic Fedora 20
https://bugzilla.redhat.com/show_bug.cgi?id=1054350
"rpm scriptlets are exiting with status 127"
Hey,
can't we add a default boot entry which starts the system in permissive
mode?
How would that help? If a user knows enough about the issue to try it
he/she could just switch to permissive mode
in *that* case

in a case where a broken selinux update leads in not boot at all
i can not imagine what i would to besides boot with a CD/DVD/USB

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 246 bytes
Desc: OpenPGP digital signature
URL: <http://lists.fedoraproject.org/pipermail/devel/attachments/20140124/db62a211/attachment.sig>
Reindl Harald
2014-01-24 18:35:17 UTC
Permalink
Post by Reindl Harald
Post by drago01
Post by Fabian Deutsch
Post by Kevin Kofler
it is time to analyze the fallout from the following catastrophic Fedora 20
https://bugzilla.redhat.com/show_bug.cgi?id=1054350
"rpm scriptlets are exiting with status 127"
Hey,
can't we add a default boot entry which starts the system in permissive
mode?
How would that help? If a user knows enough about the issue to try it
he/she could just switch to permissive mode
in *that* case
in a case where a broken selinux update leads in not boot at all
i can not imagine what i would to besides boot with a CD/DVD/USB
to be clear - *i can* edit the boot-params and put selinux=0 there

the average user can't but he may remember "uhm something with selinux
was one of the last updates" and try the however named option, keep
in mind some people own only one machine and can't google for help

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 246 bytes
Desc: OpenPGP digital signature
URL: <http://lists.fedoraproject.org/pipermail/devel/attachments/20140124/34c330fc/attachment.sig>
Daniel J Walsh
2014-01-24 19:22:13 UTC
Permalink
Post by Reindl Harald
Post by Reindl Harald
Post by drago01
Post by Fabian Deutsch
Post by Kevin Kofler
it is time to analyze the fallout from the following catastrophic
https://bugzilla.redhat.com/show_bug.cgi?id=1054350 "rpm scriptlets
are exiting with status 127"
Hey,
can't we add a default boot entry which starts the system in
permissive mode?
How would that help? If a user knows enough about the issue to try it
he/she could just switch to permissive mode
in *that* case
in a case where a broken selinux update leads in not boot at all i can
not imagine what i would to besides boot with a CD/DVD/USB
to be clear - *i can* edit the boot-params and put selinux=0 there
the average user can't but he may remember "uhm something with selinux was
one of the last updates" and try the however named option, keep in mind
some people own only one machine and can't google for help
enforcing=0 in the kernel command line will boot the machine in permissive mode.
Reindl Harald
2014-01-24 19:27:02 UTC
Permalink
Post by Daniel J Walsh
Post by Reindl Harald
Post by Reindl Harald
Post by drago01
Post by Fabian Deutsch
Post by Kevin Kofler
it is time to analyze the fallout from the following catastrophic
https://bugzilla.redhat.com/show_bug.cgi?id=1054350 "rpm scriptlets
are exiting with status 127"
Hey,
can't we add a default boot entry which starts the system in
permissive mode?
How would that help? If a user knows enough about the issue to try it
he/she could just switch to permissive mode
in *that* case
in a case where a broken selinux update leads in not boot at all i can
not imagine what i would to besides boot with a CD/DVD/USB
to be clear - *i can* edit the boot-params and put selinux=0 there
the average user can't but he may remember "uhm something with selinux was
one of the last updates" and try the however named option, keep in mind
some people own only one machine and can't google for help
enforcing=0 in the kernel command line will boot the machine in permissive mode
please re-read what you have quoted and don't skip "average user" this time

the question was "can't we add a default boot entry which starts the system in
permissive mode?" and the first reply "If a user knows enough about the issue"

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 246 bytes
Desc: OpenPGP digital signature
URL: <http://lists.fedoraproject.org/pipermail/devel/attachments/20140124/2bced215/attachment.sig>
drago01
2014-01-24 20:13:50 UTC
Permalink
Post by Reindl Harald
Post by Reindl Harald
Post by drago01
Post by Fabian Deutsch
Post by Kevin Kofler
it is time to analyze the fallout from the following catastrophic Fedora 20
https://bugzilla.redhat.com/show_bug.cgi?id=1054350
"rpm scriptlets are exiting with status 127"
Hey,
can't we add a default boot entry which starts the system in permissive
mode?
How would that help? If a user knows enough about the issue to try it
he/she could just switch to permissive mode
in *that* case
in a case where a broken selinux update leads in not boot at all
i can not imagine what i would to besides boot with a CD/DVD/USB
to be clear - *i can* edit the boot-params and put selinux=0 there
the average user can't but he may remember "uhm something with selinux
was one of the last updates"
You are assuming that the "averange user" even knows what selinux is
or reviews the list of packages for every update.
I doubt either of them is true.
Post by Reindl Harald
and try the however named option, keep
in mind some people own only one machine and can't google for help
I doubt that. Most people do have multiple ways to access the internet
(multiple computers, tablets, phones, game consoles ...) it is 2014
not 1996.
Reindl Harald
2014-01-24 20:23:46 UTC
Permalink
Post by drago01
Post by Reindl Harald
Post by Reindl Harald
Post by drago01
Post by Fabian Deutsch
Post by Kevin Kofler
it is time to analyze the fallout from the following catastrophic Fedora 20
https://bugzilla.redhat.com/show_bug.cgi?id=1054350
"rpm scriptlets are exiting with status 127"
Hey,
can't we add a default boot entry which starts the system in permissive
mode?
How would that help? If a user knows enough about the issue to try it
he/she could just switch to permissive mode
in *that* case
in a case where a broken selinux update leads in not boot at all
i can not imagine what i would to besides boot with a CD/DVD/USB
to be clear - *i can* edit the boot-params and put selinux=0 there
the average user can't but he may remember "uhm something with selinux
was one of the last updates"
You are assuming that the "averange user" even knows what selinux is
or reviews the list of packages for every update.
I doubt either of them is true.
as i said often:

linux systems tend also to get way too closed

many things are hidden in the assumption "the user do not want to be disturbed
with this and that information and install as well as boot needs to be pretty
and shiny"

* rhgb
* quiet
* hidden grub-menu

hence, while you install Fedora there should be a (default enabled) checkbox
asking if you want to enable SELinux with a short description what it is
Post by drago01
Post by Reindl Harald
and try the however named option, keep
in mind some people own only one machine and can't google for help
I doubt that. Most people do have multiple ways to access the internet
(multiple computers, tablets, phones, game consoles ...) it is 2014
not 1996
technically yes

practically how much fun does somebody have to google on a smart-phone
for a solution while he is frustrated and angry - and even if - do not
assume that all users are living in your social structure, that is not
really true

for me it is no problem, on the other hand there is a guy on the CentOS
list with the thread "died again" seeking for a hardware problem and
stating he has no money to chnage his 10 or so years old computer while
you and i would have thrown out that crap by the next window weeks ago

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 246 bytes
Desc: OpenPGP digital signature
URL: <http://lists.fedoraproject.org/pipermail/devel/attachments/20140124/1923b23a/attachment.sig>
Ian Malone
2014-01-25 03:13:23 UTC
Permalink
Post by drago01
Post by Reindl Harald
and try the however named option, keep
in mind some people own only one machine and can't google for help
I doubt that. Most people do have multiple ways to access the internet
(multiple computers, tablets, phones, game consoles ...) it is 2014
not 1996.
Most well-off people in Europe and North America. Even then it's not
necessarily particularly convenient, and if net access on a machine
with a problem isn't available then you have further issues if files
are needed.
Beside, even given I have a laptop and a desktop machine, if I'm
running Fedora on both and an update breaks something in a way that
isn't immediately obvious then both could get hit. And looking up
technical instructions on a phone is pretty tedious.
--
imalone
http://ibmalone.blogspot.co.uk
Fabian Deutsch
2014-01-24 18:37:42 UTC
Permalink
Post by drago01
Post by Fabian Deutsch
Post by Kevin Kofler
it is time to analyze the fallout from the following catastrophic Fedora 20
https://bugzilla.redhat.com/show_bug.cgi?id=1054350
"rpm scriptlets are exiting with status 127"
Hey,
can't we add a default boot entry which starts the system in permissive
mode?
How would that help? If a user knows enough about the issue to try it
he/she could just switch to permissive mode.
I mean, don't we have a general "save boot" / "emergency boot" entry -
we could add enforcing=0 there.

- fabian
Konstantin Ryabitsev
2014-01-24 18:53:10 UTC
Permalink
Post by Fabian Deutsch
Post by drago01
How would that help? If a user knows enough about the issue to try it
he/she could just switch to permissive mode.
I mean, don't we have a general "save boot" / "emergency boot" entry -
we could add enforcing=0 there.
I like this idea. Then the solution would have been "reboot into
rescue mode and update your system".

Let's do this.
--
Konstantin Ryabitsev
LinuxFoundation.org
Montréal, Québec
Matthew Miller
2014-01-24 19:26:55 UTC
Permalink
Post by Konstantin Ryabitsev
Post by Fabian Deutsch
I mean, don't we have a general "save boot" / "emergency boot" entry -
we could add enforcing=0 there.
I like this idea. Then the solution would have been "reboot into
rescue mode and update your system".
Let's do this.
https://bugzilla.redhat.com/show_bug.cgi?id=1057768
--
Matthew Miller -- Fedora Project -- <mattdm at fedoraproject.org>
Heiko Adams
2014-01-24 19:32:42 UTC
Permalink
Post by drago01
Post by Fabian Deutsch
Post by Kevin Kofler
it is time to analyze the fallout from the following catastrophic Fedora 20
https://bugzilla.redhat.com/show_bug.cgi?id=1054350
"rpm scriptlets are exiting with status 127"
Hey,
can't we add a default boot entry which starts the system in permissive
mode?
How would that help? If a user knows enough about the issue to try it
he/she could just switch to permissive mode.
Having the ability to revoke stable updates an a way to handle automatic
downgrades of revoked updates including a temporary switching SELinux to
permissive mode would IMHO be a better solution for the case a buggy
update went to stable and the system is still up and running. With this
way the user has nothing more to do than running a new update-check.
--
Regards,

Heiko Adams

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: This is a digitally signed message part
URL: <http://lists.fedoraproject.org/pipermail/devel/attachments/20140124/e9eb3377/attachment-0001.sig>
Michael Schwendt
2014-01-24 18:26:24 UTC
Permalink
Post by Kevin Kofler
it is time to analyze the fallout from the following catastrophic Fedora 20
https://bugzilla.redhat.com/show_bug.cgi?id=1054350
"rpm scriptlets are exiting with status 127"
* We are losing users to Ubuntu because of this issue.
Always hard to comment on without have seen numbers/statistics first.

IMO, there's some sort of "pork cycle" related to users switching
distributions forth and back after a couple of releases. Around Fedora 18
users of Ubuntu have returned. The current Anaconda is not everyone's
coup of tea, however, so that's one point where users are lost already.
Post by Kevin Kofler
* The bug now has 38 (!) duplicates in Bugzilla, plus many complaints on
IRC, mailing lists, comments to other unrelated bugs (the fix for which
cannot be installed due to the SELinux bug) etc.
As I've searched for and closed several of the dupes, the various bug
reports are interesting. The users have misidentified the problem as
being a problem in the package/update they wanted to install. Some have
noticed many packages failing and have blamed Yum. I wonder how many
users are affected by the problem and have not done anything yet (since,
for example, they expect an update to fix it "magically").
Post by Kevin Kofler
* That update made it out to the stable updates! In other words, the
draconian Update Policies that were enacted in a vain attempt to prevent
such issues from happening utterly failed at catching this bug.
Those policies are not "draconian" enough [1]. On erroneous belief that
a +1 from three different testers would mean that the update has seen
enough testing, the test update has been published with the default karma
threshold of +3. The testers have failed. It's too simple for testers to
rush through the voting in bodhi without testing the updates
painstakingly. "The faster the better" has lead to a fatal mistake in
this case.

[1] It's up to the package maintainers to disable karma automatism or
to increase the threshold. AFAIK, the selinux maintainers are open to
doing exactly that.
Adam Williamson
2014-01-24 19:14:50 UTC
Permalink
Post by Michael Schwendt
Post by Kevin Kofler
* That update made it out to the stable updates! In other words, the
draconian Update Policies that were enacted in a vain attempt to prevent
such issues from happening utterly failed at catching this bug.
Those policies are not "draconian" enough [1]. On erroneous belief that
a +1 from three different testers would mean that the update has seen
enough testing, the test update has been published with the default karma
threshold of +3. The testers have failed. It's too simple for testers to
rush through the voting in bodhi without testing the updates
painstakingly. "The faster the better" has lead to a fatal mistake in
this case.
I think that's being unnecessarily harsh on the testers. It's not at all
obvious to anyone that you ought to test update/install of another
package in order to validate an update to selinux-policy-targeted .
Hell, I don't do that.

Hate to sound like a broken record, but really the problem here is just
the complete lack of granularity in the karma system: to phrase it
theoretically, we know there are a huge spectrum of meanings for both +1
and -1:

+1
--

* I installed it and nothing blew up
* I installed it, rebooted and nothing blew up
* I installed it, ran the entire test suite, grabbed the source tarball
and inspected it line-by-line for vulnerabilities, fuzz tested all the
variable handling, then deployed it to my extensive test farm for a week
and assessed the results
* It fixes my bug, and I didn't test anything else
* It fixes my bug, and nothing blew up
* It fixes my bug, and...(you see where I'm going with this)
* It installs, it works, maybe it fixes some bugs, but it also
introduces this other regression
* I like the update text / the update submitter / candy

-1
--

* It failed to install
* I installed it, and something blew up
* I installed it, rebooted and something blew up
* (etc)
* It doesn't fix my bug (and that's the only bug the update was meant to
fix)
* It doesn't fix my bug (but the update also fixes 50 other bugs,
successfully)
* It doesn't fix this other bug I have that the update didn't even claim
to fix
* It installs, it works, maybe it fixes some bugs, but it also
introduces this other new bug (yes. this is identical to one of the +1
entries. That is the point. The same thing can also be registered as 0,
giving us the perfect set. Depending on the details of what's fixed and
what's broken, and the individual karma submitter's instincts, it can
seem 'right' to file this as any one of the three possible values.)
* It installs, it works, it doesn't exactly introduce any bugs, but I
think it is not compliant with the update policy (i.e. too drastic a
change in behaviour from the previous package)
* I don't like the update text / the update submitter / candy

The 'comment' field exists to allow people to express all these things,
but as it's just a completely free-form text field, it's intrinsically
impossible to really base any programmatic stuff or even policy on it.
In theory maintainers could submit updates without using autokarma and
then keep a careful eye on the feedback and 'tend' their updates
manually, but I think it's pretty clear that in practice, this is not
what happens: maintainers really want to be able to use the karma system
as a 'helper', they want to farm out the evaluation process to Bodhi/the
karma system. But our current system is too stupid to handle this
perfectly, so we get these breakdowns.

With a more flexible karma system we have a *lot* of opportunity to do
much cleverer stuff. We can provide presets for all the above different
things that are currently commonly expressed via +1 or -1 with a
comment. This opens up possibilities at two different levels: the distro
policy level, and the packager level. We can make the distro policy much
more fine-grained, if we want to - we can require certain of the 'karma
types' to be available in all updates, and for instance, block any
update where X people pull the 'it's completely busted' or 'it
introduces a security vulnerability' cord, regardless of how much
broadly-categorized 'positive' karma it has. At the packager level, the
packager gets the freedom to define a much more fine-grained policy for
when they're happy that updates to their package are 'good to go', but
they still don't have to sit there reading the emails and manually
interpreting what people have written. You get to define the policy that
makes the most sense for your package, within the confines of the
distro-wide policy - if you have a good package-specific test suite, you
can say to the auto-karma system 'don't send this update out until at
least one person sets the "I ran the test suite and it passed" karma
property.

Those are just examples: the point is that what we badly need here is a
more expressive and flexible system. (As well, as I've said elsewhere in
the discussion, as a good automated test for this specific and
well-known category of 'delayed action' update problems).
--
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
http://www.happyassassin.net
Dominick Grift
2014-01-24 20:06:29 UTC
Permalink
Post by Adam Williamson
I think that's being unnecessarily harsh on the testers. It's not at all
obvious to anyone that you ought to test update/install of another
package in order to validate an update to selinux-policy-targeted .
Hell, I don't do that.
Agreed, The testers did not fail. Their issues were solved. They could
not have found this issue in reason. There was no change log entry for
it, and even if there was they would still would need to be able trace
the bug to SELinux.

The commit that caused this grief, was meant for rawhide but it ended up
in the wrong branch. (too many branches, too high volume fixes, Too much
rushing. Routine can be a killer)

Only the people that know this matter could have prevented it in reason,
anything else would have been just luck in my opinion.

In my view there are two issues that need to be addressed.

- Take your time, think twice before you commit and double check what
you commit. Do not let routine get the upper hand. Give all your changes
the attention they deserve.

- Coordinate with your team mates. Proof read each others commits (if
only skim over them)

If a team proof reads each others commits then this could have been
prevented in reason. Sure it is a structural investment but it pay's off
and its not as bad as it sounds. Five minutes daily?
Michael Schwendt
2014-01-24 20:38:14 UTC
Permalink
Post by Dominick Grift
Agreed, The testers did not fail. Their issues were solved.
That doesn't match what one can read here:

https://admin.fedoraproject.org/updates/FEDORA-2014-0806/selinux-policy-3.12.1-116.fc20
Post by Dominick Grift
They could not have found this issue in reason.
Why not? Please explain.
Post by Dominick Grift
There was no change log entry for it,
You make it sound as if the testers have tried to skim over the several of
dozen bugzilla ticket descriptions linked at
https://admin.fedoraproject.org/updates/FEDORA-2014-0806/selinux-policy-3.12.1-116.fc20
in an attempt at trying to find out _what_ the update touches.

A fundamental problem here is that even if a tester confirms that the
update fixes a _single_ bug, the other several dozens of changes could
cause regression -> reason to be careful and test this thing a bit longer.
Post by Dominick Grift
and even if there was they would still would need to be able trace
the bug to SELinux.
That has been easy once the update arrived here on the nearby mirror.
"setenforce 0 && repeat previous command that caused strange behaviour"
is a very common troubleshooting thing, even if there haven't been any
AVC denied messages.
Dominick Grift
2014-01-24 21:17:20 UTC
Permalink
Post by Michael Schwendt
Post by Dominick Grift
Agreed, The testers did not fail. Their issues were solved.
https://admin.fedoraproject.org/updates/FEDORA-2014-0806/selinux-policy-3.12.1-116.fc20
I just had a quick look at the above URL. From all the testers there was
one guy that noticed the anomaly, and the biggest part of the events
weren't even related to the RPM issue

The RPM issue did not cause the normal AVC (type=AVC) denials (AFAIK)
that one would expect. Instead there were some SELINUX_ERR events
(type=SELINUX_ERR) that one might not notice if one is looking for AVC
(type=AVC) denials. (not sure if setroubleshoot would have reported
those)

The person that did notice the anomaly did some thorough testing, and
maybe there was also a little bit of luck involved there
Post by Michael Schwendt
Post by Dominick Grift
They could not have found this issue in reason.
Why not? Please explain.
Because you would need to run RPM to notice it, and then be able to
correlate the issue to SELinux. If you are waiting for a package that
has your fixes then you test your issues and give karma, it may take a
while before one actually runs yum again, and by then the update may
have been ended up in the repository.

And this is just in the case of RPM. There can be bugs in policy for
many components, but those are often not fatal.
Post by Michael Schwendt
Post by Dominick Grift
There was no change log entry for it,
You make it sound as if the testers have tried to skim over the several of
dozen bugzilla ticket descriptions linked at
https://admin.fedoraproject.org/updates/FEDORA-2014-0806/selinux-policy-3.12.1-116.fc20
in an attempt at trying to find out _what_ the update touches.
A fundamental problem here is that even if a tester confirms that the
update fixes a _single_ bug, the other several dozens of changes could
cause regression -> reason to be careful and test this thing a bit longer.
Sure, what i am saying is that this could have been prevented if the
team just put a little more passion into it and also did some proof
reading/coordination. The team knows whats going on. They know the
issues and they can quickly and effortlessly identify issues like these
if only they would take some time to watch each others commits.
Post by Michael Schwendt
Post by Dominick Grift
and even if there was they would still would need to be able trace
the bug to SELinux.
That has been easy once the update arrived here on the nearby mirror.
"setenforce 0 && repeat previous command that caused strange behaviour"
is a very common troubleshooting thing, even if there haven't been any
AVC denied messages.
If it was as common as you make it sound then maybe it might not have
come this far. It did. Again, one would have first had to identify the
issue (e.g. run RPM). There was no indication of any change related to
RPM (no change log entry).

But sure i give you that, yes thorough testing could have also prevented
this. (i still think its pretty unlikely but it could so i will take
that back)

Never the less, I think this issue could have been prevented even before
a package was spun.
Michael Schwendt
2014-01-24 21:54:27 UTC
Permalink
Post by Dominick Grift
Post by Michael Schwendt
https://admin.fedoraproject.org/updates/FEDORA-2014-0806/selinux-policy-3.12.1-116.fc20
Because you would need to run RPM to notice it,
Or Yum, DNF, Yumex, PackageKit, all tools on top of RPM would run into the
scriptlet errors. ;) Provided that you get a chance to evaluate the
installed test update for some time and the vote won't be too late.
Post by Dominick Grift
Post by Michael Schwendt
That has been easy once the update arrived here on the nearby mirror.
"setenforce 0 && repeat previous command that caused strange behaviour"
is a very common troubleshooting thing, even if there haven't been any
AVC denied messages.
If it was as common as you make it sound then maybe it might not have
come this far.
Well, as mentioned before, this test update had been marked stable and
pushed into the updates repo already before appearing in updates-testing
on more mirrors. Worse if some testers fetch packages from koji and vote
in bodhi too early. By the time the first testers noticed the scriptlet
errors it was too late, since stable updates cannot be withdrawn.
Post by Dominick Grift
It did. Again, one would have first had to identify the
issue (e.g. run RPM). There was no indication of any change related to
RPM (no change log entry).
Unconvincing. A similar thing has been prevented in a Yum Test Update some
weegs ago only because some _more_ testers have _not_ voted +1 before
actually using the updated Yum for some time. That is a lesson to
learn. Watch the votes:
https://admin.fedoraproject.org/updates/FEDORA-2013-22706/yum-3.4.3-119.fc20
Dominick Grift
2014-01-24 22:05:42 UTC
Permalink
Post by Michael Schwendt
Post by Dominick Grift
Post by Michael Schwendt
https://admin.fedoraproject.org/updates/FEDORA-2014-0806/selinux-policy-3.12.1-116.fc20
Because you would need to run RPM to notice it,
Or Yum, DNF, Yumex, PackageKit, all tools on top of RPM would run into the
scriptlet errors. ;) Provided that you get a chance to evaluate the
installed test update for some time and the vote won't be too late.
You're right, i suppose there is room for improvement there as well,
just like there is room for improvement when it comes to how the
maintainers work.

Thanks for reading, i just wanted to shine a little light on the other
end of the story, for the record.
Kevin Kofler
2014-01-25 18:17:14 UTC
Permalink
By the time the first testers noticed the scriptlet errors it was too
late, since stable updates cannot be withdrawn.
That is also not a law of Physics. In the early days of Bodhi, one could
actually unpush stuff from stable. Having stable updates become immutable is
purely a policy decision. Withdrawing faulty updates has been done in the
past (even after Bodhi stopped allowing it in the normal case; the pulling
has then been done by an admin) and should be done again. Of course it won't
fix the systems that already got upgraded, but it will (within mirroring
delays) stop MORE systems from getting affected (and those that did already
get the faulty update won't notice the difference, unless they distro-sync,
in which case withdrawing the update actually fixes them, so in no case does
it make things worse for them).

And I don't see any valid reason why stable updates cannot simply be
withdrawn or sent back to testing by the maintainer. The update notes should
also remain editable, so that bug references can be added when the bug was
only found to be fixed after the stable push, errors in the update
description can be fixed, etc.

Kevin Kofler
Michael Schwendt
2014-01-25 19:28:43 UTC
Permalink
Post by Kevin Kofler
By the time the first testers noticed the scriptlet errors it was too
late, since stable updates cannot be withdrawn.
That is also not a law of Physics. In the early days of Bodhi, one could
actually unpush stuff from stable.
Pointing that out doesn't make a difference. Obviously, I don't refer
to technical contraints. Even before bodhi, e.g., the Fedora Extras signers
could modify the master repo in an emergency situation.
Post by Kevin Kofler
Having stable updates become immutable is purely a policy decision.
Sure.
Post by Kevin Kofler
Withdrawing faulty updates has been done in the
past (even after Bodhi stopped allowing it in the normal case; the pulling
has then been done by an admin) and should be done again. Of course it won't
fix the systems that already got upgraded, but it will (within mirroring
delays) stop MORE systems from getting affected (and those that did already
get the faulty update won't notice the difference, unless they distro-sync,
in which case withdrawing the update actually fixes them, so in no case does
it make things worse for them).
Not sure that can be generalised. Distro-sync may downgrade packages.
We don't test downgrades of packages (scriptlets e.g.), and we don't test
downgrades of software either. We can't be sure downgraded software can
restore state at runtime after a previous upgrade may have touched
(= converted, renamed or replaced) config files or database files.
Downgrades could also affect dependencies and may make it necessary
to have a system update tool run distro-sync automatically. There are
enough users already, who play too much with --skip-broken instead of
reporting uninstallable updates/packages quickly.
Kevin Kofler
2014-01-25 18:10:16 UTC
Permalink
Post by Dominick Grift
Sure, what i am saying is that this could have been prevented if the
team just put a little more passion into it and also did some proof
reading/coordination. The team knows whats going on. They know the
issues and they can quickly and effortlessly identify issues like these
if only they would take some time to watch each others commits.
Looking at the history of the involved bugs, using manual pushes rather than
the broken karma automatism and taking into account Bugzilla comments, not
just Bodhi comments, would probably also have prevented this fiasco. One of
the bugs (not the one that ended up becoming the canonical bug, but an
earlier one) was reassigned to selinux-policy fairly quickly.

One of the major flaws in the Bodhi karma system is that it cannot possibly
see what happens in Bugzilla.
Post by Dominick Grift
Never the less, I think this issue could have been prevented even before
a package was spun.
Yes, by disabling SELinux by default. :-)

Kevin Kofler
Dominick Grift
2014-01-25 19:00:54 UTC
Permalink
Post by Kevin Kofler
Post by Dominick Grift
Never the less, I think this issue could have been prevented even before
a package was spun.
Yes, by disabling SELinux by default. :-)
No, that is a different discussion. Disabling SELinux does nothing to
solve this. If anything, to me this is confirmation of why we need a
good SELinux implementation. If this would happen to any other component
then a good SELinux implementation could have contained the damage
caused by issues just like this one.

The SELinux experience can, in my view be improved, and i believe your
problem is not with SELinux itself but with how it is
configured/implemented by default.

I just believe that a little team coordination, and a little more care
can go a long way, and that that is likely more efficient than trying to
create tests that would catch all of the bugs which sounds like utopia
to me.

I am not saying that the tests can't be improved or that they should not
be improved. It's just that in this case a little bit more care and a
double check by another involved party would probably have prevent this,
and similar other issues, in my view.
Kevin Kofler
2014-01-25 20:51:54 UTC
Permalink
Post by Dominick Grift
No, that is a different discussion.
Nonsense. That SELinux should be disabled is the whole point of this thread
(I know, I have started it!), all the suggestions (in the various
subthreads) of how to paper over the problem are off topic.
Post by Dominick Grift
Disabling SELinux does nothing to solve this.
Oh sure it does. It eliminates this whole class of breakage (critical
components unable to do their job because SELinux gets in their way) once
and for all. This type of breakage keeps occurring, in fact one instance is
still ongoing in Rawhide while we're discussing this:
https://bugzilla.redhat.com/show_bug.cgi?id=1052317

Disable SELinux and nothing will keep those components (RPM, display
managers, etc.) from doing their work.
Post by Dominick Grift
If anything, to me this is confirmation of why we need a good SELinux
implementation. If this would happen to any other component then a good
SELinux implementation could have contained the damage caused by issues
just like this one.
You don't seem to understand at all what SELinux is (it is not a tool
magically able to fix bugs, its only purpose is to keep programs from doing
their work)

Post by Dominick Grift
The SELinux experience can, in my view be improved, and i believe your
problem is not with SELinux itself but with how it is
configured/implemented by default.

 nor what it can or cannot do. (No amount of configuration can make SELinux
do anything other than block (i.e. break) things.)
Post by Dominick Grift
I just believe that a little team coordination, and a little more care
can go a long way, and that that is likely more efficient than trying to
create tests that would catch all of the bugs which sounds like utopia
to me.
What "coordination"?
Post by Dominick Grift
I am not saying that the tests can't be improved or that they should not
be improved. It's just that in this case a little bit more care and a
double check by another involved party would probably have prevent this,
and similar other issues, in my view.
Sure, dropping autokarma could help (and should be done in any case, that
Bodhi "feature" never made any sense), but ultimately there's no way around
disabling SELinux. Enabling it by default in Fedora has always been a
mistake!

Kevin Kofler
Dominick Grift
2014-01-25 21:08:29 UTC
Permalink
Post by Kevin Kofler
Post by Dominick Grift
No, that is a different discussion.
Nonsense. That SELinux should be disabled is the whole point of this thread
(I know, I have started it!), all the suggestions (in the various
subthreads) of how to paper over the problem are off topic.
Sorry, I must have misinterpreted the subject: "Drawing lessons from
fatal SELinux bug"
Post by Kevin Kofler
Post by Dominick Grift
Disabling SELinux does nothing to solve this.
Oh sure it does. It eliminates this whole class of breakage (critical
components unable to do their job because SELinux gets in their way) once
and for all. This type of breakage keeps occurring, in fact one instance is
https://bugzilla.redhat.com/show_bug.cgi?id=1052317
Disable SELinux and nothing will keep those components (RPM, display
managers, etc.) from doing their work.
Post by Dominick Grift
If anything, to me this is confirmation of why we need a good SELinux
implementation. If this would happen to any other component then a good
SELinux implementation could have contained the damage caused by issues
just like this one.
You don't seem to understand at all what SELinux is (it is not a tool
magically able to fix bugs, its only purpose is to keep programs from doing
their work)

I did not mean to suggest that. I meant to suggest that SELinux would be
able to contain the damage, referring to "fatal" in: "Drawing lessons
from fatal SELinux bug"
Post by Kevin Kofler
Post by Dominick Grift
The SELinux experience can, in my view be improved, and i believe your
problem is not with SELinux itself but with how it is
configured/implemented by default.

 nor what it can or cannot do. (No amount of configuration can make SELinux
do anything other than block (i.e. break) things.)
Actually it is the other way around. SELinux blocks everything by
default. Everything needs to be explicitly allowed by means of
"configuration"
Post by Kevin Kofler
Post by Dominick Grift
I just believe that a little team coordination, and a little more care
can go a long way, and that that is likely more efficient than trying to
create tests that would catch all of the bugs which sounds like utopia
to me.
What "coordination"?
For example coordinate who does what where and when.
Post by Kevin Kofler
Post by Dominick Grift
I am not saying that the tests can't be improved or that they should not
be improved. It's just that in this case a little bit more care and a
double check by another involved party would probably have prevent this,
and similar other issues, in my view.
Sure, dropping autokarma could help (and should be done in any case, that
Bodhi "feature" never made any sense), but ultimately there's no way around
disabling SELinux. Enabling it by default in Fedora has always been a
mistake!
Kevin Kofler
Michael Schwendt
2014-01-24 20:17:03 UTC
Permalink
Post by Adam Williamson
It's not at all
obvious to anyone that you ought to test update/install of another
package in order to validate an update to selinux-policy-targeted .
Hell, I don't do that.
Amazing.

https://admin.fedoraproject.org/updates/FEDORA-2014-0806/selinux-policy-3.12.1-116.fc20

| selinux-policy-3.12.1-116.fc20 critical path bugfix update


https://fedoraproject.org/wiki/Critical_path_package#Actions

| Packages within the critical path are required to perform the
| most fundamental actions on a system. Those actions include:
|
| [...]
| get updates
| [...]


How to understand that?

Especially with regard to downloading builds from koji, installing them
manually and voting +1 even before a test update has entered the repo.

The fast people, who do that regularly in addition to a daily yum update,
could not escape from this bug. On the contrary, other users who don't
update often, have skipped the bad selinux policy update.

I consider it likely that the testers would have noticed Yum/RPM update
errors, if only they had used their updated systems normal for let's say
one or two days and at least one reboot.

There's also

fedora-easy-karma --installed-min-days=4

which is can very helpful, since you won't be asked for updates installed
just a few minutes ago.

Also let's not forget, for testing an selinux-policy-targeted update,
you ought to run with SELinux in enforcing mode.
Post by Adam Williamson
The 'comment' field exists to allow people to express all these things,
but as it's just a completely free-form text field,
... and even can be left empty :( so a packager doesn't get any
explicit feedback from the tester other than the +1.
Post by Adam Williamson
Those are just examples: the point is that what we badly need here is a
more expressive and flexible system. (As well, as I've said elsewhere in
the discussion, as a good automated test for this specific and
well-known category of 'delayed action' update problems).
Is it so hard for testers to slow down a bit until such a system will be
available? ;-)
Adam Williamson
2014-01-24 20:39:55 UTC
Permalink
Post by Michael Schwendt
Post by Adam Williamson
It's not at all
obvious to anyone that you ought to test update/install of another
package in order to validate an update to selinux-policy-targeted .
Hell, I don't do that.
Amazing.
https://admin.fedoraproject.org/updates/FEDORA-2014-0806/selinux-policy-3.12.1-116.fc20
| selinux-policy-3.12.1-116.fc20 critical path bugfix update
https://fedoraproject.org/wiki/Critical_path_package#Actions
| Packages within the critical path are required to perform the
|
| [...]
| get updates
| [...]
How to understand that?
It also says:

graphical network install
post-install booting
decrypt encrypted filesystems
graphics
login
networking
get updates
minimal buildroot
compose new trees
compose live

Are we to be expected to re-test every single one of those actions with
every single critical path update? That seems unreasonable. If that were
the approach, I think Kevin would have an actual apoplexy while waiting
for his updates to get released. :)
Post by Michael Schwendt
Especially with regard to downloading builds from koji, installing them
manually and voting +1 even before a test update has entered the repo.
The fast people, who do that regularly in addition to a daily yum update,
could not escape from this bug. On the contrary, other users who don't
update often, have skipped the bad selinux policy update.
I consider it likely that the testers would have noticed Yum/RPM update
errors, if only they had used their updated systems normal for let's say
one or two days and at least one reboot.
There's also
fedora-easy-karma --installed-min-days=4
which is can very helpful, since you won't be asked for updates installed
just a few minutes ago.
Yup, indeed. Of course, this is another area where we could improve the
tooling: it doesn't seem like it'd be difficult for maintainers to be
allowed to set a minimum timeframe before their update goes stable, but
at present this isn't possible.
Post by Michael Schwendt
Also let's not forget, for testing an selinux-policy-targeted update,
you ought to run with SELinux in enforcing mode.
This is already generally understood, I think.
Post by Michael Schwendt
Post by Adam Williamson
The 'comment' field exists to allow people to express all these things,
but as it's just a completely free-form text field,
... and even can be left empty :( so a packager doesn't get any
explicit feedback from the tester other than the +1.
Post by Adam Williamson
Those are just examples: the point is that what we badly need here is a
more expressive and flexible system. (As well, as I've said elsewhere in
the discussion, as a good automated test for this specific and
well-known category of 'delayed action' update problems).
Is it so hard for testers to slow down a bit until such a system will be
available? ;-)
Well, at present, it kind of is, because we have this Bodhi-Bugzilla
interaction where when an update is submitted that claims to fix a given
bug, Bodhi posts a comment on the bug that says 'try this update and
leave positive karma if it fixes your bug'.

I mean, we could tweak that process, but I still think the Grand Bodhi
2.0 Vision is much more interesting, because if we have multiple karma
types we can still take and use fast 'this fixed my bug' feedback
however we feel it to be appropriate, while having _much_ more
flexibility to do 'valuation' of feedback on a 'type of feedback' and
'package being updated' basis. Again, hate to sound like a broken
record, but it's just hard to get enthusiastic about trying to twiddle
the edges of the process when the process is fundamentally inadequate.
--
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
http://www.happyassassin.net
Michael Schwendt
2014-01-24 21:36:36 UTC
Permalink
Post by Adam Williamson
Post by Michael Schwendt
https://fedoraproject.org/wiki/Critical_path_package#Actions
| Packages within the critical path are required to perform the
|
| [...]
| get updates
| [...]
How to understand that?
graphical network install
post-install booting
decrypt encrypted filesystems
graphics
login
networking
get updates
minimal buildroot
compose new trees
compose live
Are we to be expected to re-test every single one of those actions with
every single critical path update? That seems unreasonable. If that were
the approach, I think Kevin would have an actual apoplexy while waiting
for his updates to get released. :)
It would be unreasonable for a single tester, but "the more testers, the
better".

Which doesn't mean the group of testers will perform _all_ of the
(re-)tests. It means: the more testers get a chance to apply Test Updates
*and* continue using their systems normally for a few days, the more
likely it could get that some will notice issues at run-time, boot-time,
compile-time, to mention a few.

That's why I think there's reason to be very careful and sometimes even
prefer a +0 (with a comment) over a very early over-ambitious +1.

And guess what happens in non-critpath updates after 7 days and _no_
feedback. Packagers push the update manually. Sometimes with broken
deps. Sometimes the testing starts no sooner than when the update arrives
in the stable updates repo and the first real user becomes the "guinea
pig".
Post by Adam Williamson
Post by Michael Schwendt
Is it so hard for testers to slow down a bit until such a system will be
available? ;-)
Well, at present, it kind of is, because we have this Bodhi-Bugzilla
interaction where when an update is submitted that claims to fix a given
bug, Bodhi posts a comment on the bug that says 'try this update and
leave positive karma if it fixes your bug'.
Good point. Raises the question why an update that links so many bugzilla
tickets can be marked stable automatically after a +3, which may be even
about a single bz ticket.
Post by Adam Williamson
I mean, we could tweak that process, but I still think the Grand Bodhi
2.0 Vision is much more interesting, because if we have multiple karma
types we can still take and use fast 'this fixed my bug' feedback
however we feel it to be appropriate, while having _much_ more
flexibility to do 'valuation' of feedback on a 'type of feedback' and
'package being updated' basis. Again, hate to sound like a broken
record, but it's just hard to get enthusiastic about trying to twiddle
the edges of the process when the process is fundamentally inadequate.
That's understood, of course.
Stephen John Smoogen
2014-01-24 22:35:24 UTC
Permalink
Post by Michael Schwendt
Post by Adam Williamson
Post by Michael Schwendt
https://fedoraproject.org/wiki/Critical_path_package#Actions
| Packages within the critical path are required to perform the
|
| [...]
| get updates
| [...]
How to understand that?
graphical network install
post-install booting
decrypt encrypted filesystems
graphics
login
networking
get updates
minimal buildroot
compose new trees
compose live
Are we to be expected to re-test every single one of those actions with
every single critical path update? That seems unreasonable. If that were
the approach, I think Kevin would have an actual apoplexy while waiting
for his updates to get released. :)
It would be unreasonable for a single tester, but "the more testers, the
better".
Looking at the number of people who respond to the qa list at times.. I am
going to say there are probably 6-10 active testers during non-release
times. It comes and it goes, but that is about the number who seem active
at least on lists (and it seems to be that way going over archives for the
last couple of years.) So any policy would need to take into account of
that limitation.
--
Stephen J Smoogen.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.fedoraproject.org/pipermail/devel/attachments/20140124/01c34cb0/attachment.html>
Michael Schwendt
2014-01-24 22:46:41 UTC
Permalink
Post by Stephen John Smoogen
Looking at the number of people who respond to the qa list at times.. I am
going to say there are probably 6-10 active testers during non-release
times. It comes and it goes, but that is about the number who seem active
at least on lists (and it seems to be that way going over archives for the
last couple of years.) So any policy would need to take into account of
that limitation.
Any ideas how to attract more testers?

How to make the updates-testing repo more sexy?


More lessons to learn:

https://admin.fedoraproject.org/updates/FEDORA-2013-23627
Karma: 17
Stable karma: 16 (!)

It has reached the karma threshold 16 after ~5 days.
And those have not been all testers.
Reindl Harald
2014-01-24 23:21:13 UTC
Permalink
Post by Michael Schwendt
Post by Stephen John Smoogen
Looking at the number of people who respond to the qa list at times.. I am
going to say there are probably 6-10 active testers during non-release
times. It comes and it goes, but that is about the number who seem active
at least on lists (and it seems to be that way going over archives for the
last couple of years.) So any policy would need to take into account of
that limitation.
Any ideas how to attract more testers?
How to make the updates-testing repo more sexy?
https://admin.fedoraproject.org/updates/FEDORA-2013-23627
Karma: 17
Stable karma: 16 (!)
It has reached the karma threshold 16 after ~5 days.
And those have not been all testers
but where do you see a problem with that package?
hreindl - 2013-12-20 11:07:28
works for me

[root at srv-rhsoft:~]$ rpm -q yum
yum-3.4.3-132.fc20.noarch

Jan 22 12:48:00 Updated: yum-3.4.3-132.fc20.noarch
________________________________________

built explicitly to test the yum package:
Jan 22 12:52:25 Updated: lounge-rhsoft-workstation-20.0-2.fc20.20140122.rh.noarch

updates after that:
Jan 22 14:09:10 Updated: ffmpeg-latest-manpages-2.1.3-4.fc20.20140122.rh.noarch
Jan 22 14:09:11 Updated: ffmpeg-latest-2.1.3-4.fc20.20140122.rh.x86_64
Jan 22 14:28:03 Installed: iftop-1.0-0.7.pre4.fc20.x86_64
Jan 22 14:28:03 Updated: lounge-rhsoft-workstation-20.0-3.fc20.20140122.rh.noarch
Jan 23 11:29:10 Updated: apcupsd-3.14.10-14.fc20.x86_64
Jan 23 11:29:10 Updated: apcupsd-gui-3.14.10-14.fc20.x86_64
Jan 23 16:32:25 Updated: glibc-2.18-12.fc20.x86_64
Jan 23 16:32:30 Updated: glibc-common-2.18-12.fc20.x86_64
Jan 23 16:32:30 Updated: pulseaudio-libs-4.0-12.gitf81e3.fc20.x86_64
Jan 23 16:32:30 Updated: libtool-ltdl-2.4.2-23.fc20.x86_64
Jan 23 16:32:31 Updated: pulseaudio-4.0-12.gitf81e3.fc20.x86_64
Jan 23 16:32:31 Updated: pulseaudio-utils-4.0-12.gitf81e3.fc20.x86_64
Jan 23 16:32:32 Updated: glibc-headers-2.18-12.fc20.x86_64
Jan 23 16:32:33 Updated: glibc-devel-2.18-12.fc20.x86_64
Jan 23 16:32:33 Updated: pulseaudio-module-x11-4.0-12.gitf81e3.fc20.x86_64
Jan 23 16:32:33 Updated: pulseaudio-libs-glib2-4.0-12.gitf81e3.fc20.x86_64
Jan 23 16:38:22 Updated: 12:dhcp-libs-4.2.6-0.1.b1.fc20.x86_64
Jan 23 16:38:22 Updated: 12:dhcp-common-4.2.6-0.1.b1.fc20.x86_64
Jan 23 16:38:23 Updated: 12:dhcp-4.2.6-0.1.b1.fc20.x86_64
Jan 23 16:38:23 Updated: 12:dhclient-4.2.6-0.1.b1.fc20.x86_64
Jan 23 21:03:57 Updated: tzdata-java-2013i-2.fc20.noarch
Jan 23 21:03:57 Updated: crypto-utils-2.4.1-46.fc20.x86_64
Jan 23 21:03:58 Updated: krb5-libs-1.11.3-39.fc20.x86_64
Jan 23 21:03:59 Updated: google-crosextra-caladea-fonts-1.002-0.3.20130214.fc20.noarch
Jan 23 21:04:01 Updated: webkitgtk-2.2.4-1.fc20.x86_64
Jan 23 21:04:02 Updated: tzdata-2013i-2.fc20.noarch
Jan 24 17:41:58 Updated: fltk-1.3.2-3.fc20.x86_64
Jan 24 17:41:58 Updated: libnl3-3.2.24-1.fc20.x86_64
Jan 24 17:41:58 Updated: freeglut-2.8.1-3.fc20.x86_64
Jan 24 17:41:58 Updated: pango-1.36.1-2.fc20.x86_64

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 246 bytes
Desc: OpenPGP digital signature
URL: <http://lists.fedoraproject.org/pipermail/devel/attachments/20140125/322ba708/attachment.sig>
Michael Schwendt
2014-01-25 00:22:38 UTC
Permalink
Post by Reindl Harald
Post by Michael Schwendt
https://admin.fedoraproject.org/updates/FEDORA-2013-23627
Karma: 17
Stable karma: 16 (!)
It has reached the karma threshold 16 after ~5 days.
And those have not been all testers
but where do you see a problem with that package?
With the bodhi ticket? I don't. :)
I've pointed at this because it's another good example of an
increase karma threshold.

Of course, and at the risk of repeating it too often, the better
example (due to the early votes) is this:
https://admin.fedoraproject.org/updates/FEDORA-2013-22706/yum-3.4.3-119.fc20
Reindl Harald
2014-01-24 23:34:23 UTC
Permalink
Post by Michael Schwendt
Post by Stephen John Smoogen
Looking at the number of people who respond to the qa list at times.. I am
going to say there are probably 6-10 active testers during non-release
times. It comes and it goes, but that is about the number who seem active
at least on lists (and it seems to be that way going over archives for the
last couple of years.) So any policy would need to take into account of
that limitation.
Any ideas how to attract more testers?
How to make the updates-testing repo more sexy?
* i am running updates testing 365/24 over the last 3 years
* i am running "yum --enablerepo=updates-testing --security" in production
* i test this packages in "near production" (test-vm-mirrors)

the only ones i leave out are *real* server packages because i
build them in general at my own inlcuding major-updates Fedora
not see at all or the other direction (PHP 5.4 first time with F18)
Post by Michael Schwendt
More lessons to learn
yes here: https://bugzilla.redhat.com/show_bug.cgi?id=1019251

Joe Orton 2013-11-18 07:15:12 EST
Upstream is gearing up for 2.4.7 RSN - ECC support will get
picked up automagically when we do a new f19 build

where is the 2.4.7 build for F19?

more than 3 months ago i had httpd-2.4.6 with ECDHE *in production* on F18
and so confirmed that httpd works fine as expected with the new openssl
https://bugzilla.redhat.com/show_bug.cgi?id=319901#c108

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 246 bytes
Desc: OpenPGP digital signature
URL: <http://lists.fedoraproject.org/pipermail/devel/attachments/20140125/d9c79b2b/attachment.sig>
Kevin Kofler
2014-01-25 18:29:12 UTC
Permalink
Post by Michael Schwendt
https://admin.fedoraproject.org/updates/FEDORA-2013-23627
Karma: 17
Stable karma: 16 (!)
It has reached the karma threshold 16 after ~5 days.
And those have not been all testers.
That can work for yum, but if I set the stable karma to 16 for kfloppy, the
release will reach its EOL without it getting there (unless somebody targets
testers specifically at kfloppy to win the bet ;-) 
 and buys them floppy
drives, too ;-) ). Heck, even if Fedora releases didn't have an EOL, you or
me probably wouldn't live to see it go stable.

Kevin Kofler
Michael Schwendt
2014-01-25 19:28:52 UTC
Permalink
Post by Kevin Kofler
Post by Michael Schwendt
https://admin.fedoraproject.org/updates/FEDORA-2013-23627
Karma: 17
Stable karma: 16 (!)
It has reached the karma threshold 16 after ~5 days.
And those have not been all testers.
That can work for yum, but if I set the stable karma to 16 for kfloppy,
Nobody suggested doing that for kfloppy or other packages, which hardly
ever get feedback in bodhi.

If a test update doesn't get any feedback in bodhi, what does that imply?

If you mark it stable after 7 days, you've tested it yourself for some
days, correct?

If the update doesn't refer to any bugzilla tickets, what does that mean?
Post by Kevin Kofler
the
release will reach its EOL without it getting there (unless somebody targets
testers specifically at kfloppy to win the bet ;-) 
 and buys them floppy
drives, too ;-) ). Heck, even if Fedora releases didn't have an EOL, you or
me probably wouldn't live to see it go stable.
Almost funny, if it weren't possible to mark test updates as stable after
7 days.

It could be that nobody uses the package at all, so it would not a big
deal if an update (or upgrade?) took 7+ days to enter the updates repo. ;-p
Kevin Kofler
2014-01-25 21:00:02 UTC
Permalink
Post by Michael Schwendt
If the update doesn't refer to any bugzilla tickets, what does that mean?
In that particular case, it means that we are updating all the KDE software
compilation and so there's a new release of KFloppy too, which most likely
doesn't even contain any actual changes from upstream (just a new version
number on the tarball), but the updates are scripted, and the version bump
is also needed to keep our metapackages (kdeutils in this case) working. :-)

That said, in practice, we file those as grouped updates and so there's a
chance that the update actually gets some karma. Surely not because of
KFloppy though. ;-)
Post by Michael Schwendt
Almost funny, if it weren't possible to mark test updates as stable after
7 days.
Right, but you were proposing to wait until it reaches a karma of +16.
Post by Michael Schwendt
It could be that nobody uses the package at all, so it would not a big
deal if an update (or upgrade?) took 7+ days to enter the updates repo. ;-p
But then the right solution is to disable karma automatism entirely, not to
set it to some ridiculously high value.

Those meaningless thresholds need to go away (and really, the whole concept
of Bodhi karma and the policies that depend on it).

Kevin Kofler
Reindl Harald
2014-01-25 21:11:08 UTC
Permalink
Post by Kevin Kofler
But then the right solution is to disable karma automatism entirely, not to
set it to some ridiculously high value.
Those meaningless thresholds need to go away (and really, the whole concept
of Bodhi karma and the policies that depend on it)
i am not entirely sure how that is meant

* disable the automatism to push to stable
* forget the whole karma system at all

in case of "disable the automatism to push to stable" i agree

in my opinion karma is a indication for the maintainer but not
the decision - the karma has to be handeled differently for the
same package and different updates and only the maintainer can
decide that *as person*

why?

because it depends on the change itself

speaking with my developer hat on: there are updates on software
inside our company where i do not hestitate a single seconds deploy
the new CMS version to some hundrets of customers without tell anybody
there was a update at all because *i know* there can be no bad impact

on the other hand there are updates and changes which needs to prepare
any singel webhost, rollout a small update to prepare the real one by
add database colums not used currently but need to be there in the time
window files are replaced and database scheme can be updated

the second case is for not have any single request going wrong

and there is another category where all the work above has to be done
and tested thousands of times but still need a "keep your eyes open"
after it is done because you can't test and verify every single action
a complex software may do with every possible input data



-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 246 bytes
Desc: OpenPGP digital signature
URL: <http://lists.fedoraproject.org/pipermail/devel/attachments/20140125/96d78375/attachment.sig>
Michael Schwendt
2014-01-25 21:25:59 UTC
Permalink
Post by Kevin Kofler
Right, but you were proposing to wait until it reaches a karma of +16.
Certainly not. That Yum update is only a good example where a high
karma threshold has been reached in less than a week, and even without
a vote from all available/active testers. Sure, Yum is widely-used, which
cannot be said about niche-market packages.

I'm proposing that Test Updates are offered for a minimum duration
to allow for the time it takes to push them into the repo *and*
be picked up by world-wide mirrors. That increases the chance that
available testers get a chance of evaluating an update and giving
feedback _before_ it gets marked stable because of reaching +3 very
early (based on koji downloads, for example).

That has worked rather well for the more interesting update with a karma
threshold of 5: https://admin.fedoraproject.org/updates/FEDORA-2013-22706

The -1 votes have not been too late.

On the contrary, if a tester is available and finds a bug in a new
update as offered on a nearby mirror, but in bodhi the update has been
marked stable in less than a day already or has skipped updates-testing
even, the tester cannot help anymore. That's a case that would be
avoidable.
Post by Kevin Kofler
Post by Michael Schwendt
It could be that nobody uses the package at all, so it would not a big
deal if an update (or upgrade?) took 7+ days to enter the updates repo. ;-p
But then the right solution is to disable karma automatism entirely, not to
set it to some ridiculously high value.
Yes, debatable. But we shouldn't argue about it. '5' or '10' for Yum
updates isn't ridiculously high while still giving testers the opportunity
to vote cleverly and trigger the automatic push.
Adam Williamson
2014-01-24 23:32:24 UTC
Permalink
Post by Stephen John Smoogen
Looking at the number of people who respond to the qa list at times..
I am going to say there are probably 6-10 active testers during
non-release times. It comes and it goes, but that is about the number
who seem active at least on lists (and it seems to be that way going
over archives for the last couple of years.) So any policy would need
to take into account of that limitation.
QA list posters isn't a good proxy for feedback submitters, but you can
get fairly good numbers on active update feedback submitters from Bodhi
itself. Mike Ruckman has taken over the role of posting the 'heroes of
Fedora' numbers each quarter and Fedora release, which include these
statistics:

http://roshi.fedorapeople.org/heroes-of-fedora-quarter-3-2013-statistics.html
--
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
http://www.happyassassin.net
Adam Williamson
2014-01-24 23:24:53 UTC
Permalink
Post by Michael Schwendt
That's why I think there's reason to be very careful and sometimes even
prefer a +0 (with a comment) over a very early over-ambitious +1.
And guess what happens in non-critpath updates after 7 days and _no_
feedback. Packagers push the update manually. Sometimes with broken
deps. Sometimes the testing starts no sooner than when the update arrives
in the stable updates repo and the first real user becomes the "guinea
pig".
Good point. Raises the question why an update that links so many bugzilla
tickets can be marked stable automatically after a +3, which may be even
about a single bz ticket.
See, this is what happens when we have a fundamentally inadequate
process: an eternal tug-of-war between the tendency to prioritize
'safety' (in a very dumb and insufficiently granular way) and the
tendency to prioritize 'getting updates out' (in a very dumb and
insufficiently granular way). There are reasonable arguments in favour
of both sides.

I incline to the view that any time there is a situation like this -
where there are two alternative ways of doing something, both bad in
different ways, and roughly equally strong arguments on either side -
it's not a great use of anyone's time to keep tweaking things to one end
of the continuum or the other; _we need a better process_.
--
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
http://www.happyassassin.net
Adam Williamson
2014-01-24 23:29:39 UTC
Permalink
Post by Michael Schwendt
Good point. Raises the question why an update that links so many bugzilla
tickets can be marked stable automatically after a +3, which may be even
about a single bz ticket.
Because that's how the maintainer configured it. (It is also the default
configuration, of course, but no-one is obliged to accept the default
configuration).

In Fedora we effectively define a *baseline* requirement for updates of
various types to go from testing to stable, and beyond that, we grant to
maintainers the power to choose the policy that will apply to their
updates (and, therefore, the responsibility to choose wisely).
--
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
http://www.happyassassin.net
Kevin Kofler
2014-01-25 18:22:45 UTC
Permalink
Post by Adam Williamson
Yup, indeed. Of course, this is another area where we could improve the
tooling: it doesn't seem like it'd be difficult for maintainers to be
allowed to set a minimum timeframe before their update goes stable, but
at present this isn't possible.
Why do we need to keep polishing that karma turd instead of just flushing it
away? Especially "karma automatism" is totally broken by design.
Post by Adam Williamson
Again, hate to sound like a broken record, but it's just hard to get
enthusiastic about trying to twiddle the edges of the process when the
process is fundamentally inadequate.
Oh yes, it is! So let's do away with it!

Kevin Kofler
Richard W.M. Jones
2014-01-25 10:43:22 UTC
Permalink
Post by Adam Williamson
Post by Michael Schwendt
Post by Kevin Kofler
* That update made it out to the stable updates! In other words, the
draconian Update Policies that were enacted in a vain attempt to prevent
such issues from happening utterly failed at catching this bug.
Those policies are not "draconian" enough [1]. On erroneous belief that
a +1 from three different testers would mean that the update has seen
enough testing, the test update has been published with the default karma
threshold of +3. The testers have failed. It's too simple for testers to
rush through the voting in bodhi without testing the updates
painstakingly. "The faster the better" has lead to a fatal mistake in
this case.
I think that's being unnecessarily harsh on the testers. It's not at all
obvious to anyone that you ought to test update/install of another
package in order to validate an update to selinux-policy-targeted .
Hell, I don't do that.
Doesn't / can't AutoQA (or whatever we're calling it these days) pick
up the new package, install it in a VM, and run through some automated
tests:

- Does Fedora still boot with this package added?
- Does GNOME still come up?
- Does yum still work?

At least the third one might have automatically found this bug.

Rich.
--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine. Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/
Adam Williamson
2014-01-25 16:37:03 UTC
Permalink
Post by Richard W.M. Jones
Post by Adam Williamson
I think that's being unnecessarily harsh on the testers. It's not at all
obvious to anyone that you ought to test update/install of another
package in order to validate an update to selinux-policy-targeted .
Hell, I don't do that.
Doesn't
No, it doesn't.
Post by Richard W.M. Jones
/ can't AutoQA (or whatever we're calling it these days) pick
up the new package, install it in a VM, and run through some automated
- Does Fedora still boot with this package added?
- Does GNOME still come up?
- Does yum still work?
I answered precisely that question a couple of days ago in the other
thread:

https://lists.fedoraproject.org/pipermail/devel/2014-January/194155.html
--
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
http://www.happyassassin.net
Kevin Kofler
2014-01-25 18:03:00 UTC
Permalink
Post by Adam Williamson
The 'comment' field exists to allow people to express all these things,
but as it's just a completely free-form text field, it's intrinsically
impossible to really base any programmatic stuff or even policy on it.
In theory maintainers could submit updates without using autokarma and
then keep a careful eye on the feedback and 'tend' their updates
manually, but I think it's pretty clear that in practice, this is not
what happens: maintainers really want to be able to use the karma system
as a 'helper', they want to farm out the evaluation process to Bodhi/the
karma system. But our current system is too stupid to handle this
perfectly, so we get these breakdowns.
It's not that they WANT to be able to do that, it's that the system is
rigged to encourage that broken practice. Autokarma (ab)use was much lower
before the enactment of the Update Policies. And really, autokarma needs to
just go away entirely. Having an intelligent being interpret the free-form
text field is the only way to make sane decisions (which also implies that
we should not arbitrarily impose any restrictions that, by their nature,
cannot take the free-form text into account).
Post by Adam Williamson
With a more flexible karma system we have a *lot* of opportunity to do
much cleverer stuff. We can provide presets for all the above different
things that are currently commonly expressed via +1 or -1 with a
comment. This opens up possibilities at two different levels: the distro
policy level, and the packager level. We can make the distro policy much
more fine-grained, if we want to - we can require certain of the 'karma
types' to be available in all updates, and for instance, block any
update where X people pull the 'it's completely busted' or 'it
introduces a security vulnerability' cord, regardless of how much
broadly-categorized 'positive' karma it has. At the packager level, the
packager gets the freedom to define a much more fine-grained policy for
when they're happy that updates to their package are 'good to go', but
they still don't have to sit there reading the emails and manually
interpreting what people have written. You get to define the policy that
makes the most sense for your package, within the confines of the
distro-wide policy - if you have a good package-specific test suite, you
can say to the auto-karma system 'don't send this update out until at
least one person sets the "I ran the test suite and it passed" karma
property.
To me, this just screams "OVERENGINEERED!!!". :-(

You are introducing a lot of complexity, that will ultimately always only be
an approximation to reality. You just cannot reliably quantify all the
details. E.g. "this introduces a regression, but the regression is that you
sometimes have to click OK twice (instead of once) to format a 5.25" floppy
in KFloppy, whereas it fixes a critical bug in KDE Plasma Desktop where all
your data was sent to the NSA and then securely wiped from your hard disk".
:-) (Yes, of course that is an exaggerated example. ;-) I sure hope we don't
have bugs like that. ;-) ) If you only have "fixes a bug, but introduces a
regression" as a feedback type, you probably end up making the wrong
decision. If you try to get more fine-grained, then you again need numbers
to quantify the severity of one issue vs. the other, and those will
inherently be subjective. (Users always think THEIR bug is the end of the
world whereas regressions that don't affect them are entirely unimportant.)

The complexity also means there are a lot more arbitrary parameters to deal
with. The current stable/unstable thresholds are already bad enough, and
often end up set to the wrong value. A decision process tends to be the
worse the more arbitrary parameters it needs.

And of course, more complexity means less transparency. It becomes harder
and harder to understand what really needs to be satisfied for an update to
be allowed to go stable.

So I can only advocate for the KISS approach: The update is stable when the
maintainer says so, period. We do not need any karma, be it a simple ±1 or a
long (and inherently non-exhaustive) list of all the things that can
possibly happen. So let's just do away with the 3 radio buttons and use a
free-form text field only. We just need somebody able to read comments, and
that is what we have maintainers for!

Kevin Kofler
Loading...