The Origin of RPM Content
Overview
Pulp supports importing and publishing RPM content, but where does all this
content come from? Who is generating it, and with what tools? The answer to
these questions are answered in this document. Note that this document
only covers Fedora and Red Hat Enterprise Linux. It does not cover SLES,
OpenSuse, Scientific Linux, CentOS, Oracle Linux, or any other RPM-based
distributions (yet).
The RPM
RPM is a packaging format. Many Linux distributions use RPM to distribute their
software packages. An RPM is a collection of metadata about the package as well
as a payload. What the payload contains depends on the type of RPM. There are
three different types of RPM:
- Source code RPMs (referred to as “SRPMs”, “.src.rpm” file extension)
- Binary RPMs (referred to as “RPMs”, “.rpm” file extension)
- Binary delta RPMs (referred to as “DRPMs”, “.drpm” file extension)
What each RPM contains is completely up to the creator of the RPM package. It
could, for example, contain an entire operating system. However, this is very
unweildy. Many distributions (Fedora, Red Hat, etc.) have strict packaging
guidelines about what should be included. The long version can be found on the
Fedora Wiki
Package Guideline,
but the short version is that each RPM should contain at most a single software
project (for example, the Linux kernel, GCC, or binutils). Generally, if
documentation is large it is separated into its own RPM. If the software
project is compiled, debugging symbols are packaged separately as well. This
means there can be several RPMs for a single project.
SRPM
The payload of a source RPM is simply a compressed tarball of source code and a
file, called a spec file, that describes how to turn the source RPM into a
binary RPM. The spec file includes the installation location and permissions
for all files in the package. This allows the RPM tool to install
the binary RPM to a system with the correct permissions and track which package
“owns” which files. Spec files also allow the author to run shell scripts before
or after an installation, removal, or upgrade of a package. The source RPM can
be built into one or more binary RPMs using rpmbuild
.
RPM
A binary RPM’s payload is the collection of files installed from the build
process of the source code. It is architecture and distribution-specific.
If a package happens to be architecture-independent, it can declare its
architecture as noarch
.
DRPM
A binary delta RPM has a payload which contains the binary diff between two
releases of the same package. For example, there could be a binary delta RPM
that can be used to upgrade an existing Firefox 45 installation to a Firefox
46 installation without downloading the entire binary RPM for Firefox 46. This
format exists to save bandwidth for content provides and requires a significant
amount of computation on the client attempting to install the DRPM. Unlike RPMs
and SRPMs, DRPMs are not created by rpmbuild
. Other tools exist to build
them, like deltarpm,
or createrepo_c
which leverages the drpm library.
When using createrepo_c, it is possible to generate DRPMs and the prestodelta.xml
metadata required for an RPM repository (covered below).
The Yum/RPM Repository
RPM provides a packaging format, but the RPM (often referred to as yum)
repository provides a way to distribute them. An RPM repository consists of one
or more RPM packages and some metadata describing what RPM packages the
repository contains. The metadata, usually located in a directory called
“repodata” in the root of the repository, is contained in several XML and/or
optionally several SQLite files. DNF does not make use of the SQLite databases
(which contain the same metadata as the XML), although some clients might. The
filenames of each of these metadata files can be arbitrary, so clients locate
them by using a metadata file that describes the metadata: the repomd.xml file.
To create an RPM repository, all that is required is that the repomd.xml,
primary.xml, filelists.xml, and other.xml metadata files be present. There are
two libraries that can do this: createrepo, and
createrepo_c.
createrepo is a Python library that is no longer maintained. createrepo_c
is a C library with Python bindings that is actively maintained.
repomd.xml
repomd.xml is the metadata file that clients use to discover what repository
metadata files exist in the repository. It should always be located at
repodata/repomd.xml
relative to the root of the repository. It references the
location of all other metadata files for the repository. This means that the
other metadata files might not be located in the repodata/
directory, but it
is convension to store all RPM repository metadata in repodata/
and all
current Fedora, Red Hat, and CentOS repositories do this.
The repomd.xml file (XML namespace: http://linux.duke.edu/metadata/repo which is
sadly a dead link) contains data
elements with one attribute, type
. The
type
attribute is a string which references the type of metadata file the
data
element refers to. Common values are group
, filelists
, group_gz
,
primary
, other
, filelists_db
, primary_db
, and other_db
. <thing>_db
refer to SQLite versions of the metadata, while those sans _db
refer to XML
versions. Each data
element contains several other elements describing the
metadata and where it is located: checksum
, location
, timestamp
, and
size
seem to be always present, with open-size
, open-checksum
, and
database_version
potentially appearing as well.
primary.xml
The primary.xml file (often stored in repodata/<file-checksum>-primary.xml.gz
)
contains a list of every RPM and SRPM package (DRPMs are covered by
prestodelta.xml below) in the repository (and the network location to download
them). This includes information like the name, epoch version, release, and
architecture. It also lists what libraries and binaries the package provides, as
well as what libraries and binaries the package depends upon to work. This
metadata can be used by the client to determine the dependency tree of a
package, how much data it will need to download, and how much space the packages
will take up when installed. Try doing yum install <some uninstalled package>
some time and note how it describes what it’s going to install for dependencies
and how much space it’s going to take up. All that comes from this metadata
file.
filelists.xml
The filelists.xml metadata does exactly what the name implies. It is a list of
every single file contained in each RPM package. Like the primary.xml file, it
contains a list of package
elements (which references packages from the
primary.xml file), within which there are a number of file
elements, as well
as a version
element that identifies the package version. Files that are
directories have a type=dir
attribute.
other.xml
The other.xml contains… well, other information about each package. It
references each package in much the same way as filelists.xml. At the very
least, it contains changelog
elements, where an element exists for each
changelog entry in the spec file used to build the RPM. Typically this is
truncated, often to the 10 most recent releases.
comps.xml
comps.xml contains, among other things, a list of groups. Each group contains
a description and a list of packages in that group. Packages can be marked as
mandatory, default, or optional, based on the value of the type
attribute
on the packagereq
element.
Additional metadata in comps.xml are package environments and categories, which
are simply a list of package groups, and langpacks.
prestodelta.xml
prestodelta.xml is used to describe the DRPMs a repository contains. A DRPM is
built from two different binary RPMs (a new version and an old version). A
repository can, and often does, contain several DRPMs for various upgrade paths.
For example, there might be a DRPM containing the difference between
firefox-45.0 and firefox-46.0, as well as a DRPM containing the difference
between firefox-45.1 and firefox-46.0. A client must retrieve the correct DRPM
for the version of a package it currently has installed to apply the DPRM.
The prestodata
root element contains zero or more newpackage
elements. Each
newpackage
element has name
, epoch
, version
, release
, and arch
attributes to identify what the new version of the package is.
Each newpackage
element contains one or more delta
elements. The delta
element has the oldepoch
, oldversion
, and oldrelease
attributes to
identify which old version of the package the DRPM applies to.
Each delta
element contains 4 elements: filename
, sequence
, size
, and
checksum
.
For example:
<?xml version="1.0" encoding="UTF-8"?>
<prestodelta>
<newpackagename="cmake-fedora" epoch="0" version="2.6.0" release="1.fc23" arch="noarch">
<delta oldepoch="0" oldversion="2.3.4" oldrelease="2.fc23">
<filename>drpms/cmake-fedora-2.3.4-2.fc23_2.6.0-1.fc23.noarch.drpm</filename>
<sequence>cmake-fedora-2.3.4-2.fc23-84bdd3315d4caddf8245e82cb83de4e301d5</sequence>
<size>51194</size>
<checksum type="sha256">6926544188f70d0e9dbedfd07fcf361d6fdc813d2888f5635fd647069bcc14ed</checksum>
</delta>
<delta oldepoch="0" oldversion="2.5.1" oldrelease="1.fc23">
<filename>drpms/cmake-fedora-2.5.1-1.fc23_2.6.0-1.fc23.noarch.drpm</filename>
<sequence>cmake-fedora-2.5.1-1.fc23-9930049f7b6f6c78a7732f5230c38f6e0196</sequence>
<size>34154</size>
<checksum type="sha256">45012a502babf1bdda402c05b50c1c68f8c5dbe62d85ce61a0a41c71c0ec6f8c</checksum>
</delta>
</newpackagename>
</prestodelta>
updateinfo.xml
updateinfo.xml describes errata. An erratum describes a change in an RPM
repository. Errata are typically divided into three categories: security,
bugfix, and enhancement. If a package is being updated to fix a security
problem, the erratum for that update is a security erratum. If it is simply a
bug with no (known) security implications, it is a bugfix erratum. Finally, the
update could be to provide additional features, in which case it is an
enhancement erratum.
In Fedora, the updateinfo.xml metadata is generated by
Bodhi. It is created when an update
is pushed by Bodhi and injected into the RPM repository metadata using the
modifyrepo_c tool, part of the createrepo_c package.
What errata reference vary from project to project and product to product. For
example, Red Hat Enterprise Linux and CentOS issue an erratum per component
(SRPM package). However, other projects and products might issue a single
erratum for many components at once. Therefore, an erratum references a list of
one or more RPM packages since one SRPM can produce many RPM packages.
Each errata has a pkglist
element, which contains a collection
element,
which contains a name
element and one or more package
elements. Each package
element has name
, version
, release
, epoch
, and arch
attributes to
identify the affected package. In addition to those attributes, there is a src
attribute. In RHEL errata, this appears to be the name of the SPRM:
<package name="java-1.7.0-openjdk" version="1.7.0.55" release="2.4.7.2.el7_0" epoch="1" arch="x86_64" src="java-1.7.0-openjdk-1.7.0.55-2.4.7.2.el7_0.src.rpm">
However, in Fedora this src
field references where the package is located by URL:
<package name="opendnssec" version="1.4.9" release="1.fc23" epoch="0" arch="i686" src="https://download.fedoraproject.org/pub/fedora/linux/updates/23/i386/o/opendnssec-1.4.9-1.fc23.i686.rpm">
Each package
element contains a filename
element, and in RHEL errata, a
sum
element.
Organizing RPM Builds
As you now know from the RPM section, each package requires a source tarball and
a spec file. In addition to these two required files, a packager may create
patch files that alter the source code in some way. This is done for many
reasons, but generally it is done to work around a bug in the upstream project,
back-port a bugfix from upstream, or unbundle libraries. All this can become
unwieldy to manage and track, especially when dealing with thousands of packages
(Fedora contains ~18,000 packages). Fedora uses
dist-git to solve this problem.
dist-git is designed specificly to manage RPM package sources. It stores the
spec file, patches, and a reference to the source tarball in a git repository.
The source tarball itself is not checked into Git and instead lives in a
lookaside cache. The validity of the source tarball is determined by the
reference checked into the git repository. Each package is contained in its own
dist-git repository. This allows package maintainers to collaborate and view the
history of a package.
Of course, having the sources, patches, and spec files organized doesn’t help
much if the RPMs have to be built manually.
Koji (and to some extent
Copr is a tool to build and track SRPMs
and RPMs from those dist-git repositories. It performs the builds in clean, secure
environments for many different architectures by using
Mock. Each build can be tagged to help
track where each build ends up. This is helpful when we want to turn a
collection of packages into an operating system distribution. An example of a
tag would be f24
, f24-updates
, or f24-updates-candidate
.
Composes
Having all the packages built and tracked in a tool like Koji is only helpful if
there are tools to turn those packages into useful, consumable content. What is
useful content?
- RPM repositories from which packages can be installed
- Installation media (ISOs for CD/DVD, PXE boot images, USB boot images, etc)
- Arbritrary additional files such as release notes, licenses, EULA, GPG keys,
and branding images.
Fedora and RHEL have the concept of a
compose.
A set of packages make of a product release (Fedora 24, for example). The set of
packages used in a compose can be controlled by the tag a package has in Koji.
As a release is developed, new packages are added and current packages are
updated or removed. A compose is an immutable snapshot at a certain point in
time of a product release’s development. At some point, the compose is deemed to
be “gold” and becomes the GA release of a product. For example, Fedora 23 is a
release of the Fedora product.
A compose contains one or more variants. A
variant
is a particular subset of the set of packages used in the compose. One subset
might target servers, another workstations, and another Atomic hosts. Each
variant is built for one or more architectures (i686, x86_64, sparc, ppc64, etc).
Each of these variant builds for a specific architecture are referred to as
trees. A
tree is made up of:
- One or more RPM repositories
- Bootable ISO images
- PXE boot images including EFI boot files, ISOLINUX boot files, and one or
more kernel images with initial RAM disks.
Almost all the content in a tree is described in a metadata file called the
treeinfo
file (sometimes .treeinfo
), which is located in the root of the tree
directory. This metadata file can be parsed using the Red Hat Release
Engineering tool, productmd.
To summarize, a compose is made up of variants, which are made of
architecture-specific trees.
The tool used by Fedora to create composes is called
Pungi. Pungi makes use of the Lorax
project to build each tree. Prior to the
Lorax project, trees were generated by scripts in the Anaconda installer. These scripts have been
removed
since Lorax replaces them.
As a concrete example, the Red Hat Enterprise Linux 6.7 (release) Server
(variant) x86_64 tree contains the following:
- The RPM repository (metadata in
repodata/
)
- Several addon RPM repositores (metadata in
HighAvailablility/repodata/
,
LoadBalancer/repodata/
, ResilientStorage/repodata/
, and
ScalableFileSystem/repodata/
)
- EFI/BOOT/BOOTX64.conf: EFI configuration containing references to the kernel
and initrd in
images/pxeboot/
- EFI/BOOT/BOOTX64.efi: EFI boot file for x86_64 architecture
- EFI/BOOT/splash.xpm.gz: boot splash screen graphic
- images/efiboot.img: CD/DVD boot image for EFI systems
- images/efidisk.img: USB boot image for EFI systems (can be dd’ed to a USB
- flash drive)
- images/boot.iso: Bootable ISO image built from the various images in
images/
, EFI/
, isolinux/
- images/install.img: Stage 2 Installation image, loaded when you start the
installation from a supported boot method.
- images/product.img: RHEL product description information used in the
installer
- images/pxeboot/initrd.img: Initial ramdisk file for PXE-capable systems
- images/pxeboot/vmlinuz: Kernel image for PXE-capable systems
- isolinux/: bootloader with configuration, as well as a kernel image, initial
RAM disk, and memtest.
In the above example the EFI/
and isolinux/
directories are not referenced
by metadata as they are not required by any client.
Updates
Composes are immutable, and when a product is released, it does not change.
Updates are provided in the form of errata.
When a package is updated, an erratum must be associated with it. An erratum is
metadata about the update of one or more packages, very much like the erratum for
a book. These are described in the updateinfo.xml file in an RPM repository. In
the case of Fedora and RHEL, the RPM repositories in the compose are kept pristine
and unchanged, but this is not enforced by the tooling, it is merely convention.
There may be distributions out there that add their errata and updated RPM packages
to the GA compose.
Fedora provides an excellent example of this method. When a release is made,
it is located under the released/
directory on the mirrors. For example, the
Fedora releases lives in releases/<release-version>/<variant>/<arch>/
. This
repository remains unchanged, even after updates are released for Fedora. You’ll
notice in the repodata
directories, there is no updateinfo.xml. Updates are
provided under updates/<release-version>/<arch>/
. This RPM repository does
contain updateinfo.xml, which is the errata for all the packages in this repository.
Red Hat Enterprise Linux is similar, except that releases are usually stored in
the rhel/<variant>/<major-release>/<minor-release>/<arch>/kickstart/
repository. Updates are provided in the
rhel/<variant>/<major-release>/<minor-release>/<arch>/os/
repository.
Overview of the Fedora Build Process
To get an idea of how this works in practice, the Fedora build process is outlined
below. The Fedora Release Engineering team has written documentation on their release
process which may be helpful to reference.
The basic workflow is as follows:
- Packages are created from upstream repositories (Git repositories, PyPi,
RubyGems, etc.) by creating a spec file and any necessary patches. These go
through a review process. Once approved, a dist-git repository is created for
the package and the spec file with patches are checked in. The source tarball
is uploaded to a lookaside cache (it is not checked into source control, but
a method of verifying the tarball is).
- Packages are built in Koji by the package maintainer. Each build is made for
a Koji build target. A build target specifies where a package should be built
and how it should be tagged afterwards. This allows target names to remain
fixed as tags change through releases.
- Products are composed using Pungi. This creates ISOs and other installation
media, boot images for PXE, etc.
- At a certain point in the release cycle, Fedora’s Bodhi
is turned on. After a package is built (step 2), the package maintainer submits
the build to Bodhi. It is available for testing in the updates-testing repository
and community members can +1 or -1 updates. After a certain period of time or
enough +1, the package is approved. It is pushed into the updates repository with
an entry in the updateinfo.xml metadata file.