Jan Kiszka [Tue, 12 Aug 2014 07:40:23 +0000 (09:40 +0200)]
configs/x86: Add check for VT-d existence
Add a check that validates the existence of configured VT-d units.
Require that every config contains at least one unit. Experimental VT-d
emulation, including interrupt remapping support, is now available for
QEMU. So we can set the config fields accordingly. If VT-d is disabled
or a QEMU version without that support is used, the hypervisor continues
to ignore this error.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
This fills vtd_map_interrupt with the missing logic to set entries in
the interrupt remapping table and programs the hardware to use that
table, blocking all non-remappable interrupts.
vtd_map_interrupt performs the required entry validation and rejects any
improper settings. There is one exception: CPU masks in logical
destination mode will silently be adjusted instead of failing the
service call. The reason is that Linux may have these masks set for
practically inactive interrupt vectors programmed (e.g. enabled MSI
vectors of PCIe ports without any port service using them) and will only
update them when they may be used again. Failing such settings would
stop Linux prematurely.
We reserve remapping table entries for the IOAPIC upon cell creation. We
also reserve as many entries as MSI vectors are reported for each PCI
device of a cell. IOAPIC reservation happens only once, and the region
is kept over the lifetime of the hypervisor as the IOAPIC may be shared
between cells (though this is not recommended). PCI device reservations
are released again when the device is removed from its owning cell.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Sun, 27 Jul 2014 12:42:56 +0000 (14:42 +0200)]
core: Virtualize legacy MSI for interrupt remapping support
Analogously to edge-triggered IOAPIC interrupts, handover all legacy
MSIs by disabling them first, programming the VT-d remapping table and
then writing remappable parameters into the MSI capability registers.
An additional triggering of active vectors ensures that we do not lose
events during handover.
Disabling is done on x86 via a trick: we program an empty CPU mask in
logical destination mode.
MSI-X remains on the to-do list. Thus, once enabling interrupt
remapping, systems that use MSI-X will become unsupported for the time
being.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Sat, 7 Jun 2014 12:26:34 +0000 (14:26 +0200)]
x86: Virtualize IOAPIC redir table for interrupt remapping support
Handover the IOAPIC on hypervisor setup by first masking all pins, then
requesting interrupt remapping from VT-d and finally reprogramming them
according to the index that VT-d reported. If vtd_map_interrupt returns
-ENOSYS, unconditionally right now due to a missing implementation,
later on when running in QEMU, we continue to write the redirection
table entry unmodified.
As we may lose edge-triggered interrupts while they are makes, we inject
them unconditionally into the target CPU. This may cause one spurious
interrupt per handover to/from the hypervisor, but Linux can deal with
that.
Whenever a cell is added or removed, we rewrite all root cell IOAPIC
redir table entries in order to validate them regarding the target CPU
masks.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Fri, 8 Aug 2014 07:23:50 +0000 (09:23 +0200)]
core: Report root cell as added/removed on initial config commit
This aligns the initial config commit with those performed on non-root
cell creation and destruction. We only need to avoid some then redundant
operations on the root cell in x86's arch_config_commit and
vtd_config_commit.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Tue, 1 Jul 2014 16:56:50 +0000 (18:56 +0200)]
x86: Factor out vtd_update_gcmd_reg
This encapsulates changes to the global command register that preserve
the current state and wait for the state change to become effective.
Both setting and clearing is supported.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Thu, 7 Aug 2014 06:25:17 +0000 (08:25 +0200)]
configs/tools: Introduce device and interrupt number limit
VT-d interrupt remapping but also (one day) AMD IOMMUs require us to
dimension related tables during setup. Introduce two parameters to the
config file, one set an upper limit of interrupts (all types) that the
system may have to control (for all cells) and another one for devices
in the system. The former will be used for VT-d, the latter should once
be helpful for AMD support, thus it can remain unset in Intel configs.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Thu, 21 Aug 2014 17:30:39 +0000 (19:30 +0200)]
tools: config-create: Remove write restriction for ACPI regions
We don't depend on ACPI in the hypervisor anymore. Also, the NVS memory
may contain some semaphore the kernel updates to synchronize with the
chipset-embedded controller. So let's give Linux full access to this
memory again.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
tools: config-create: Initial support for non-Intel boards
DMAR is Intel-specific. For AMD, we need to parse IVRS table, but
this code is currently missing since there is no IOMMU support in
AMD-V port.
Initially, we just make DMAR optional and ignore possible errors
on AMD (Intel will still fail at signature check stage if DMAR is
missing, however the error message will be somewhat misleading).
The template was also modified so missing bits of information are
not included (and IOAPIC ID is set to zero).
Signed-off-by: Valentine Sinitsyn <valentine.sinitsyn@gmail.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
We do not support complex device scope paths yet, code this into the
parsing function. This allows to simplify the call sites as well because
parse_dmar_devscope will now read as many bytes as supported or fail.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Henning Schild [Tue, 12 Aug 2014 14:56:58 +0000 (16:56 +0200)]
tools: config-create: more fine grained parsing of /proc/iomem
Parse /proc/iomem keeping the tree structure. So far we only looked at
the top level entries, mainly at the type strings. But for more fine
grained selection of memory regions we need to look deeper into the
tree.
Signed-off-by: Henning Schild <henning.schild@siemens.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Henning Schild [Thu, 7 Aug 2014 12:45:52 +0000 (14:45 +0200)]
tools: config-create: make sure we are root when reading input files
File access permissions allow users and root to read PCI config spaces
of devices via sysfs. But depending on your uid you will read different
content.
That is why we have to make sure we open those files as root and can not
just rely on getting an EPERM.
Signed-off-by: Henning Schild <henning.schild@siemens.com>
[Jan: removed extra blank line] Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Wed, 20 Aug 2014 17:08:53 +0000 (19:08 +0200)]
x86: Rework APIC_INVALID_ID to CPU_ID_INVALID
APIC_INVALID_ID is actually an invalid logical CPU ID. Rename the
constant and also ensure that no CPU is registered with this ID. That
way we can drop one extra test from apic_send_ipi as this ID will never
be included in any CPU set.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Wed, 20 Aug 2014 17:01:01 +0000 (19:01 +0200)]
core: Introduce and use cell_owns_cpu
This helper combines the check for the maximum CPU number with the
consultation of the CPU set's bitmap. Both helps to make the code more
compact and avoid to forget the former test in the future.
We keep it inline as it generates quite a bit boilerplate code
otherwise, and IPI transmission depends on efficient execution.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Henning Schild [Tue, 5 Aug 2014 12:08:50 +0000 (14:08 +0200)]
tools: config-create: locate templates next to script or in given dir
When the script was called from outside the tools dir the templates
could not be found.
This patch assumes the templates are next to the script, one can also
specify a path.
Signed-off-by: Henning Schild <henning.schild@siemens.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Mon, 4 Aug 2014 14:35:20 +0000 (16:35 +0200)]
core: Adjust max_cpu_id calculation in cell_init
No need to set the limit to what we can hold in the bitmap. The limit is
better defined by the cpu set size in the config file. This not only
simplifies the code but also shortens cpu set iterations.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Mon, 4 Aug 2014 12:16:49 +0000 (14:16 +0200)]
core: Enforce online CPUs == expected CPUs
No longer allow the hypervisor to start with less than the number of
expected CPUs. This prevents that we leave (known) CPUs under native
control after the hypervisor started.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Fri, 1 Aug 2014 15:23:26 +0000 (17:23 +0200)]
x86: Use union for MSI address & data encoding/decoding
We will work more intensively with MSIs. Encoding the address and data
word fields as a union with bit fields will simplify those tasks. For
now we can already exploit this in vtd_init_fault_nmi.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Wed, 30 Jul 2014 18:54:38 +0000 (20:54 +0200)]
core/configs/tools: Remove ACPI support from hypervisor
The is no more user of the APCI table lookup. Remove this code as well
as the config memory region in the configuration files. Update the
config generator accordingly.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Wed, 30 Jul 2014 16:15:43 +0000 (18:15 +0200)]
configs/tools: Describe DMAR units in config files
This prepares to switch from ACPI parsing to config file based DMAR unit
discovery. For simplicity reasons, we limit the number of supported DMAR
units to 8. Can be extended or made dynamic when needed.
Update the h87i config accordingly.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Wed, 30 Jul 2014 15:15:46 +0000 (17:15 +0200)]
configs: Require Q35 machine model for QEMU-based test setup
With the introduction of config-based MMCONFIG parameters, it becomes
impossible to have one QEMU config for both its PC machines. Restrict
us to the one that will soon gain VT-d support: Q35.
Update README to reflect these requirements and changes.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Mon, 4 Aug 2014 08:41:52 +0000 (10:41 +0200)]
core: Disable PCI devices on removal
Switch off any bus master, MMIO and PIO dispatching when removing a
device from a cell. Also try to suppress INTx signals (not all devices
may respect this).
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Mon, 4 Aug 2014 06:29:26 +0000 (08:29 +0200)]
core: Fix PCI device runtime ownership tracking
Trigger PCI device addition and removal from the PCI core and update the
cell field in the device state in order to track active ownership. The
vtd module now only provides callbacks to update its tables when adding
or removing a device.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Tue, 29 Jul 2014 18:34:24 +0000 (20:34 +0200)]
core: Introduce PCI device state
We will have to store a number of runtime state information for PCI
devices, specifically its owner. Allocate these states as an array
during cell creation and release them on cell destruction.
We can already use the structure to keep a reference to the cell the
device belongs to. This avoids having to pass this around over multiple
hops. It will also be used soon to encode runtime ownership by setting
or clearing the reference.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Tue, 29 Jul 2014 18:45:23 +0000 (20:45 +0200)]
core: Only perform PCI config space writes on PCI_ACCESS_PERFORM
If we emulate a config space write, we may be able to skip the physical
access completely. To model this, rename PCI_ACCESS_EMULATE to
PCI_ACCESS_DONE which signals to the caller of the moderation functions
that no physical access should be performed.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Sun, 3 Aug 2014 19:11:37 +0000 (21:11 +0200)]
core: Pass value directly to pci_cfg_write_moderate
Convert pass-by-reference to pass-by-value for the value
pci_cfg_write_moderate should handle. Reason: either we will emulate and
write in the context of the moderation, or we let the original value
pass as-is.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Sun, 27 Jul 2014 17:38:48 +0000 (19:38 +0200)]
core/configs/tools: Switch PCI configuration format to single BDF value
There is no value in splitting up the PCI device address in the config
format into bus and devfn. Fold them into a single value that can easier
be matched and is also easily be split up again via new helper macros
whenever needed.
This generates some work for locally maintained config file, sorry.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Sun, 27 Jul 2014 15:42:24 +0000 (17:42 +0200)]
core: Fix calculation of MMCONFIG region size
The PCI Firmware Specification says: "For PCI-X and PCI Express
platforms utilizing the enhanced configuration access method, the base
address of the memory mapped configuration space always corresponds to
bus number 0 (regardless of the start bus number decoded by the host
bridge) [...]." So drop the start bus from the size calculation.
Moreover, we had an off-by-one regarding end bus to size translation.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Sun, 27 Jul 2014 06:41:25 +0000 (08:41 +0200)]
inmates: pci-demo: Clear STATESTS before triggering the MSI
STATESTS may still have pending events which could prevent a MSI
delivery after controller reset. That reset doesn't clear them, so do
this explicitly.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Sun, 3 Aug 2014 17:55:31 +0000 (19:55 +0200)]
x86: Block write access to IA32_APIC_BASE MSR
The hypervisor depends on a consistent APIC mode. So prevent that a cell
can mess it up. As the APIC is kept in the same state across cell
assignments, no cell has a need to change it.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Sat, 26 Jul 2014 06:17:46 +0000 (08:17 +0200)]
x86: Filter LVT delivery modes
Do not allow cells to program anything else than Fixed or NMI mode. NMIs
will still be swallowed by the hypervisor NMI interception path, so perf
& Co. remain broken.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Sat, 26 Jul 2014 05:45:22 +0000 (07:45 +0200)]
x86: Filter writes to reserved APIC register bits
Set up a bitmap for all xAPIC/x2APIC register that marks reserved bits
(or complete registers). Check to-be-written values against this bitmap
before executing accesses in hypervisor context.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Thu, 24 Jul 2014 17:16:59 +0000 (19:16 +0200)]
x86: Prevent getting stuck while trying to clear the APIC
If some interrupt source (typically a level-triggered IOAPIC pin)
continuously sends messages to the APIC we are trying to clear from
pending bits in ISR and IRR, we will get stuck in the hypervisor in an
interrupt storm.
Avoid this by limiting the number of handled interrupts to the number of
vectors we have. When reaching this limit, simply raise TPR to break out
of the loop. It's cleared again on exit from apic_clear, and the code
booting the CPU can handle the then pending interrupt itself. That's
almost like real hardware would behave (low-prio IRR bits may remain set
due to a stuck high-prio interrupt). However, only buggy SMP cells will
once be able to trigger this, IOAPIC pins of exitings cells will soon be
masked to prevent this scenario.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Wed, 23 Jul 2014 07:19:27 +0000 (09:19 +0200)]
core: Moderate access to PCI capabilities
Make use of the capability configuration and permit write access only to
explicitly configured capabilities. Read access is harmless as it is
free of side effects.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Sun, 20 Jul 2014 09:45:37 +0000 (11:45 +0200)]
tools/configs: Describe PCI capabilities in config files
Instead of parsing the PCI config spaces in the hypervisor, offload this
to the configuration generator. It will produce a (logically) linked
list of capabilities per device, their ID, start and length. Each
capability can furthermore be marked as writable by the cell.
Note that identical capability lists shared between multiple devices
will automatically folded into a single one. The user can duplicate and
customize them individually in a manual post-processing step.
Configurations are updated for QEMU, the H87i and the pci-demo cell.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Sun, 20 Jul 2014 22:13:15 +0000 (00:13 +0200)]
tools: config-create: Simplify optional file handling and generator mode
Stop passing exceptions from input_open to its callers: An optional file
can be returned as empty (derived from /dev/null), same on failures
during collector creation. The typical mode for input_open is 'r', so we
can save passing this explicitly in most cases. Same for optional which
is generally False.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Sun, 20 Jul 2014 18:01:10 +0000 (20:01 +0200)]
tools: config-create: Return empty dir list when running in generator mode
No point in returning valid results here. The files in this directory
may not be accessible for normal users, and then the collector
generation will fail prematurely.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Tue, 22 Jul 2014 16:48:19 +0000 (18:48 +0200)]
core: Fix error handling of MMCONFIG setup
If we have an MMCONFIG region, we must either successfully map it or
fail the initialization. Succeeding without setting up pci_space will
cause crashes later on when accessing it on behalf of a cell.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Tue, 22 Jul 2014 16:37:40 +0000 (18:37 +0200)]
x86: Enable unwinding from exception handler
Preserve the .eh_frame section for the linked hypervisor objection and
only remove it from the binary. Then add .cfi directives to the
exception entry code. This enables to use a debugger for unwinding from
the exception handler to the causing function and beyond (not perfect
due to missing stack frames, though).
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Mon, 21 Jul 2014 18:22:08 +0000 (20:22 +0200)]
core: Rework PCI config space access handling
Move more logic into generic code by extending the write handler to
pci_cfg_write_moderate and introducing pci_cfg_read_moderate. These
handlers are responsible for any config space access, including to
unowned or non-existent devices. They can reject the access, return an
emulated value on read or a real value to be written to hardware, or
they instruct the caller to perform the access directly.
We already pass a reference to the issuing cell to the access handlers.
It stays unused for now but will be needed by succeeding changes. So
add it now to avoid changing API and callers once again later on.
This commit lays the foundation for capability access moderation and,
specifically, MSI/MSI-X emulation.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>