Jonathan Dieter
2018-11-16 22:07:09 UTC
For reference, this is in reply to Paul's email about lifecycle
objectives, specifically focusing on problem statement #1[1].
<tl;dr>
Have rpm use zchunk as its compression format, removing the need for
deltarpms, and thus reducing compose time. This will require changes
to both the rpm format and new features in the zchunk format.
</tl;dr>
*deltarpm background*
As part of the compose process, deltarpms are generated between each
new rpm and both the GA version of the rpm and the previous version.
This process is very CPU and memory intensive, especially for large
rpms.
This also means that deltarpms are only useful for an end user if they
are either updating from GA or have been diligent about keeping their
system up-to-date. If a user is updating a package from N-2 to N,
there will be no deltarpm and the full rpm will be downloaded.
*zchunk background*
As some are aware, I've been working on zchunk[2], a compression format
that's designed for highly efficient deltas, and using it minimize
metadata downloads[3].
The core idea behind zchunk is that a file is split into independently
compressed chunks and the checksum of each compressed chunk is stored
in the zchunk header. When downloading a new version of the file, you
download the zchunk header first, check which chunks you already have,
and then download the rest.
*Proposal*
My proposal would be to make zchunk the rpm compression format for
Fedora. This would involve a few additions to the zchunk format[4]
(something the format has been designed to accommodate), and would
require some changes to the rpm file format.
*Benefit*
The benefit of zchunked rpms is that, when downloading an updated rpm,
you would only need to download the chunks that have changed from
what's on your system.
The uncompressed local chunks would be combined with the downloaded
compressed chunks to create a local rpm that will pass signature
verification without needing to recompress the uncompressed local
chunks, making this computationally much faster than rebuilding a
deltarpm, a win for users.
The savings wouldn't be as good as what deltarpm can achieve, but
deltarpms would be redundant and could be removed, completely
eliminating a large step from the compose process.
*Drawbacks*
1. Downloading a new release of a zchunked rpm would be larger than
downloading the equivalent deltarpm. This is offset by the fact
that the client is able to work out which chunks it needs no matter
what the original rpm is, rather than needing a specific original
rpm as deltarpm does.
2. The rebuilt rpm may not be byte-for-byte identical to the original,
but will be able to be validated without decompression, as explained
in the next section
*Changes*
The zchunk format would need to be extended to allow for a zchunked rpm
to contain both the uncompressed chunks that were already on the local
system and the newly downloaded compressed chunks while still passing
signature verification. This would also require moving signature
verification to zchunk.
The rpm file format has to be changed because the zchunk header needs
to be at the beginning of the file in order for the zchunk library
figure out which chunks it needs to download. My suggestions for
changes to the rpm file format are as follows:
1. Signing should be moved to the zchunk format as described at the
beginning of this section
2. The rpm header should be stored in one stream inside the zchunk
file. This allows it to be easily extracted separately from the
data
3. The rpm cpio should be stored in a second stream inside the zchunk
file.
4. At minimum, an optional zchunk element should be set to identify
zchunk rpms as rpms rather than regular zchunk files. If desired,
optional elements could also be set containing %{name}, %[version},
%{release}, %{arch} and %{epoch}. This would allow this information
to be read easily without needing to extract the rpm header stream.
*Final notes*
I realize this is a massive proposal, zchunk is still very young, and
we're still working on getting the dnf zchunk pull requests reviewed.
I do think it's feasible and provides an opportunity to eliminate a
pain point from our compose process while still reducing the download
size for our users.
[1]:
https://fedoraproject.org/wiki/Objectives/Lifecycle/Problem_statements#Challenge_.231:_Faster.2C_more_scalable_composes
[2]: https://github.com/zchunk/zchunk
[3]: https://fedoraproject.org/wiki/Changes/Zchunk_Metadata
[4]: https://github.com/zchunk/zchunk/blob/master/zchunk_format.txt
_______________________________________________
devel mailing list -- ***@lists.fedoraproject.org
To unsubscribe send an email to devel-***@lists.fedoraproject.org
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/***@lists.fedoraproject.o
objectives, specifically focusing on problem statement #1[1].
<tl;dr>
Have rpm use zchunk as its compression format, removing the need for
deltarpms, and thus reducing compose time. This will require changes
to both the rpm format and new features in the zchunk format.
</tl;dr>
*deltarpm background*
As part of the compose process, deltarpms are generated between each
new rpm and both the GA version of the rpm and the previous version.
This process is very CPU and memory intensive, especially for large
rpms.
This also means that deltarpms are only useful for an end user if they
are either updating from GA or have been diligent about keeping their
system up-to-date. If a user is updating a package from N-2 to N,
there will be no deltarpm and the full rpm will be downloaded.
*zchunk background*
As some are aware, I've been working on zchunk[2], a compression format
that's designed for highly efficient deltas, and using it minimize
metadata downloads[3].
The core idea behind zchunk is that a file is split into independently
compressed chunks and the checksum of each compressed chunk is stored
in the zchunk header. When downloading a new version of the file, you
download the zchunk header first, check which chunks you already have,
and then download the rest.
*Proposal*
My proposal would be to make zchunk the rpm compression format for
Fedora. This would involve a few additions to the zchunk format[4]
(something the format has been designed to accommodate), and would
require some changes to the rpm file format.
*Benefit*
The benefit of zchunked rpms is that, when downloading an updated rpm,
you would only need to download the chunks that have changed from
what's on your system.
The uncompressed local chunks would be combined with the downloaded
compressed chunks to create a local rpm that will pass signature
verification without needing to recompress the uncompressed local
chunks, making this computationally much faster than rebuilding a
deltarpm, a win for users.
The savings wouldn't be as good as what deltarpm can achieve, but
deltarpms would be redundant and could be removed, completely
eliminating a large step from the compose process.
*Drawbacks*
1. Downloading a new release of a zchunked rpm would be larger than
downloading the equivalent deltarpm. This is offset by the fact
that the client is able to work out which chunks it needs no matter
what the original rpm is, rather than needing a specific original
rpm as deltarpm does.
2. The rebuilt rpm may not be byte-for-byte identical to the original,
but will be able to be validated without decompression, as explained
in the next section
*Changes*
The zchunk format would need to be extended to allow for a zchunked rpm
to contain both the uncompressed chunks that were already on the local
system and the newly downloaded compressed chunks while still passing
signature verification. This would also require moving signature
verification to zchunk.
The rpm file format has to be changed because the zchunk header needs
to be at the beginning of the file in order for the zchunk library
figure out which chunks it needs to download. My suggestions for
changes to the rpm file format are as follows:
1. Signing should be moved to the zchunk format as described at the
beginning of this section
2. The rpm header should be stored in one stream inside the zchunk
file. This allows it to be easily extracted separately from the
data
3. The rpm cpio should be stored in a second stream inside the zchunk
file.
4. At minimum, an optional zchunk element should be set to identify
zchunk rpms as rpms rather than regular zchunk files. If desired,
optional elements could also be set containing %{name}, %[version},
%{release}, %{arch} and %{epoch}. This would allow this information
to be read easily without needing to extract the rpm header stream.
*Final notes*
I realize this is a massive proposal, zchunk is still very young, and
we're still working on getting the dnf zchunk pull requests reviewed.
I do think it's feasible and provides an opportunity to eliminate a
pain point from our compose process while still reducing the download
size for our users.
[1]:
https://fedoraproject.org/wiki/Objectives/Lifecycle/Problem_statements#Challenge_.231:_Faster.2C_more_scalable_composes
[2]: https://github.com/zchunk/zchunk
[3]: https://fedoraproject.org/wiki/Changes/Zchunk_Metadata
[4]: https://github.com/zchunk/zchunk/blob/master/zchunk_format.txt
_______________________________________________
devel mailing list -- ***@lists.fedoraproject.org
To unsubscribe send an email to devel-***@lists.fedoraproject.org
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/***@lists.fedoraproject.o