Jan Kiszka [Sat, 4 Jul 2015 21:31:27 +0000 (23:31 +0200)]
core: ivshmem: Fix cell disconnection
Move the disconnect call before the potential endpoint copy operation.
Otherwise we risk to update the stale second entry, not the now active
first one.
This change also ensures that disconnect is performed even for the last
endpoint. This will allow us to put cleanup tasks into that function
that have to be executed unconditionally.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Sat, 4 Jul 2015 21:14:30 +0000 (23:14 +0200)]
core: ivshmem: Mark BARs as 64-bit again
Regression of 294110a887: Like physical devices fill their bar array
during setup, virtual devices need to do this as well. Namely, the
64-bit flag got lost during migration to generic BAR emulation.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Sat, 4 Jul 2015 11:36:53 +0000 (13:36 +0200)]
x86: Prevent usage of MMX, SSE, and AVX by compiler
The compiler may decide to use MMX, SSE or even AVX for copying data or
similar purposes. Prevent this because we neither initialize the related
units nor save/restore their state between the different worlds.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Wed, 1 Jul 2015 05:03:20 +0000 (07:03 +0200)]
x86: Embed page for EPT/NPT root_table into cell structure
Both Intel and AMD need this page and currently allocate it
programmatically. We can safe some logic, specifically error handling,
by reserving the page in the cell structure.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
"Get to know" Jailhouse originally appeared in Linux Journal issue
252 (April 2015). As of May 2015, it can be redistributed freely,
so add its slightly updated version to Documentation.
Signed-off-by: Valentine Sinitsyn <valentine.sinitsyn@gmail.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
We are preparing to import yet another article about Jailhouse, so
it makes sense to have dedicated place to store them. Also, add a
timestamp to article's filename, so one can easily say its
publication date.
Signed-off-by: Valentine Sinitsyn <valentine.sinitsyn@gmail.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Sun, 24 May 2015 08:39:28 +0000 (10:39 +0200)]
configs: Add a linux-x86-demo cell configuration
This demonstrates non-root Linux booting. It is targeting the QEMU
reference setup but can easily be tailored for physical setups as well.
The config contains an ivshmem device to demonstrate both PCI device
discovery and inter-cell communication. Of the four available CPUs in
the QEMU setup, 3 are assigned to the cell to show that SMP works.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Sun, 24 May 2015 08:10:22 +0000 (10:10 +0200)]
tools, inmates: Add "cell linux" subcommand to jailhouse tool
This adds support for loading and booting paravirtualized x86 Linux
kernels in non-root cells. The jailhouse tool is extended for this
purpose with a new subcommand "cell linux" that accepts the cell
configuration, the kernel image and an optional initrd as input. Also a
kernel command line can be specified. The script then creates the cell,
unless it already exists, load kernel, initrd, a special boot loader and
the required parameters for that loader into the cell RAM. Finally, it
starts the cell.
The interface between python helper and the boot loader inmate is based
on the kernels boot_params structure with a custom setup_data extension.
The former is initialized by the python help, specifically to inform
Linux about the location of its initrd and the command line. It also
contains an e820 list to report the memory layout. The setup_data is
filled by the boot loader with information about the PM timer address
and the available CPUs as well as their physical APIC IDs. For that
purpose, the Linux cell requires a communication region.
Although the loader script is currently x86-only, extension to ARM is
surely feasible as well.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Sun, 24 May 2015 08:00:02 +0000 (10:00 +0200)]
inmates: Add infrastructure for inmates that serve as tools
We will had an x86 inmate that will support the booting of Linux in
non-root cells. This lays the foundation for such tools, including their
installation into $(libexecdir)/jailhouse.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Mon, 27 Apr 2015 18:42:13 +0000 (20:42 +0200)]
configs: Extend inmates memory in qemu config
Reduce the hypervisor memory to 6 MB, which is still plenty, so that we
can create more or larger inmates. Reorder and extend the description
accordingly.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Sat, 2 May 2015 10:40:05 +0000 (12:40 +0200)]
x86: Ignore writes to the xAPIC ID register
Writing to the APIC ID register is legal in xAPIC mode but is ignored by
recent CPU models. Linux performs a write on boot-up, e.g., and ignoring
this is both cheap and helpful to keep para-virtualization needs low.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Fri, 1 May 2015 11:00:09 +0000 (13:00 +0200)]
x86: Implement standard hypervisor detection protocol
This provides cpuid-based Jailhouse detection conforming to the protocol
also used by other major hypervisors: set bit 31 of ecx for function
0x01, provide a signature via function 0x40000000 and a so far empty
feature set via function 0x40000001.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Fri, 1 May 2015 10:12:27 +0000 (12:12 +0200)]
x86: Always intercept cpuid
Refactor vmx_handle_cpuid to vcpu_handle_cpuid and ensure that both VMX
and SVM use it for emulating guest cpuid invocations. That means SVM has
to intercept it now.
We will need this to reliably indicate the presence of Jailhouse to our
inmates.
CC: Valentine Sinitsyn <valentine.sinitsyn@gmail.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Mon, 18 May 2015 06:51:27 +0000 (08:51 +0200)]
core: ivshmem: Use generic BAR emulation
Simplify the code by relying on the PCI core to emulate BAR writes. This
just requires proper settings of the bar_mask fields of ivshmem devices
in configs.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Mon, 18 May 2015 05:22:09 +0000 (07:22 +0200)]
core: ivshmem: Refactor pci_ivshmem_cfg_write
We can do simpler by passing in the bias-shifted row value to be written
and the access byte-mask. Then pci_ivshmem_cfg_write just needs to
combine the new value with those of the other bytes as needed, and we
can drop all the size-specific dispatching.
This also lays the foundation for reusing generic BAR emulation.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Sun, 17 May 2015 09:50:04 +0000 (11:50 +0200)]
core: Add basic BAR write emulation for physical PCI devices
This enables cell to explore the size of PCI device resources by writing
1's to base address registers and then reading back which bits got
modified. We so far didn't support this because Linux in the root cell
already retrieved the sized before Jailhouse ran and other cell could
have been customized to use preconfigured information.
However, adding this features only increases the code by few ten lines
while making life for preexisting inmate OSes, including Linux,
significantly easier. Moreover, we will save some code again when
switching ivshmem's BAR emulation to this version.
Note that this does NOT allow cells to remap PCI device resources in
their address space. That would require more effort with at limited
benefits. Given that we preconfigure all BARs, neither Linux nor other
OSes have a need to change them. Any attempt to do so will simply have
no effect.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Sun, 17 May 2015 09:06:50 +0000 (11:06 +0200)]
core, tools: Add BAR masks to jailhouse_pci_device
Add a new field per BAR to the PCI device configuration. It allows to
mask the modifiable part of a BAR before storing writes. This will
support BAR write emulation that is required to make PCI resource sizes
explorable by cells.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Sun, 17 May 2015 08:47:33 +0000 (10:47 +0200)]
core: pci: Rework config space header write moderation
Switch to a more powerful array-based write access control for the PCI
config space header. The array consists of tuples, each controlling the
access to one dword row. Access can be denied, permitted or emulated as
read-only, thus ignored. As before, a mask selects the bytes of the row
for which the access type applies.
This new model allows to properly describe which registers of the bridge
header we effectively want to freeze as read-only so that Linux can
rescan buses.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Mon, 18 May 2015 06:49:52 +0000 (08:49 +0200)]
core: ivshmem: Improve error reporting
The warning in ivshmem_update_msix is actually fatal (callers will fail
the CPU when we return an error code), and we need some additional
reporting on MMIO accesses. The latter avoids that we just get a
register dump, no information where the problem was detected.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
An ivshmem PCI device always has a valid ivshmem_endpoint pointer, that
is ensured by ivshmem_connect_cell, called during device initialization.
And there is nothing that invalidates the pointer during device
lifetime. So we can remove related NULL checks.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Sun, 17 May 2015 08:39:19 +0000 (10:39 +0200)]
core: pci: Skip architecture hooks on virtual device addition/removal
arch_pci_add_device and arch_pci_remove_device acted as nops for virtual
PCI devices so far, and there is no change in sight. So stop calling the
hooks from pci_add/remove_virtual_device, drop related checks from the
vtd code and rename functions that work on physical devices to clarify
their scope.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Fri, 15 May 2015 07:57:41 +0000 (09:57 +0200)]
driver: Prevent disabling when there are offlined CPUs
If Linux has some of the CPUs offlined itself, i.e. not for passing them
to other cells, and we disable the hypervisor then, those CPUs will not
be released. Attempts to online them again later on will fail. Reject
disable requests in such a case.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Thu, 14 May 2015 14:26:52 +0000 (16:26 +0200)]
inmates: x86: Add basic SMP support
Under Jailhouse, all the cell CPUs are started in parallel. To enable
SMP inmates, the entry code records their number and their APIC IDs (up
to the current limit of 255). Only the first CPU arriving at the entry
check will call inmate_main, the others are parked in halt state.
Inmates can use the recorded parameters to pick up all CPUs by sending
them regular INIT/SIPI signals. We use the entry path for this case as
well: ap_entry is introduced as an alternative entry function pointer.
If it is non-NULL, the CPU will bypass the SMP startup procedure and
call that function.
The library is extended to provide a boot-up barrier and a single-CPU
wakeup service. It also adds a simple IPI service.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Fri, 15 May 2015 06:20:35 +0000 (08:20 +0200)]
x86: Report number of CPUs via communication region
Append a field to the x86-specific part of the communication region to
inform non-root cells about the number of CPUs they can expect to show
up during boot.
We can generalize this when ARM has a need as well, but it's more likely
that it will use device trees instead (which are underdeveloped on x86).
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Sat, 9 May 2015 14:57:08 +0000 (16:57 +0200)]
inmates: Build library archive and link it implicitly
Kbuild already comes with support for building lib.a archives from a set
of objects. Use this to build inmate libraries for x86, here in 64 and
32-bit form, and for ARM. Link against the correct libraries implicitly
so that the demos no longer have to state their dependencies explicitly.
This will also allow to use the inmate libraries from different folders
than demos because the library objects are now only built once.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Sat, 9 May 2015 14:51:34 +0000 (16:51 +0200)]
inmates: Define entry point in linker script
Export the reset address as symbols and define them as entry point of
our inmates in the linker scripts. We will bundle the headers together
with the other library objects in archives, and defining entry points
will ensure that the related sections will be included in the final
binary. This will simplify the inmate rules significantly.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Fri, 15 May 2015 06:09:40 +0000 (08:09 +0200)]
x86: Fix and clean up APIC ICR write handling
apic_handle_icr_write so far expects the hi_val in the format that
corresponds to the APIC mode in use. Internally, it then normalizes it
into x2APIC-mode format. That's complicating the usage and actually
enabled the bug that x2apic_handle_write did not convert the
cell-provided value into the required format.
Simplify and fix things by changing the API of apic_handle_icr_write to
accept the destination only in x2APIC format. That's much easier because
both callers can hard-code the conversion (none or shift by 24 bits) as
they know the input format.
The only side effect is that apic_send_ipi will now report errors with
ICR.hi always in x2APIC format, independent of the delivery path.
Probably even an advantage.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Fri, 15 May 2015 05:57:29 +0000 (07:57 +0200)]
x86: Flush pending events when reprogramming the VT-d error interrupt
There seems to be the risk of in-flight error events still using the
address and data registers while we reprogram them. In practice, this
shouldn't happen on a correctly configured system because all valid
interrupt sources are silenced at this point. Nevertheless, play safe,
just like Linux does.
However, there is no reason to also read back after unmasking (like
Linux does) because the hardware injects pending events when the mask is
cleared.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Wed, 13 May 2015 17:22:24 +0000 (19:22 +0200)]
tools: config-create: Fix IOMMU unit number of IOAPICs
IOAPICs under the control of IOMMUs with unit number >= 1 were not
described correctly in the generated configs due to a stupid naming
mistake that Python cannot report.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Sat, 9 May 2015 06:00:41 +0000 (08:00 +0200)]
arm: Clean up hypervisor stage 1 memory attributes
Of the many attributes defined, some probably wrong, only 3 are actually
used: normal memory, device and non-cacheable. Validate those and drop
the rest. We can re-add more as needed.
See ARM ARM B4.1.104.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Sat, 9 May 2015 05:54:53 +0000 (07:54 +0200)]
arm: Fix stage 2 memory attributes
The definition of memory attributes for stage 2 translations was wrong.
This attributes consist only of 4 bits, but the defines covered 8. Set
the proper values for those two types we use: normal memory and devices.
See ARM ARM B3.6.2 and B3.8.5 for details.
This fixes the enforcement of read-only or write-only cell memory
regions.
Reported-and-tested-by: Philipp Rosenberger <ilu@linutronix.de> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Thu, 7 May 2015 17:27:12 +0000 (19:27 +0200)]
core: Disable non-root PCI devices on shutdown
We already disable PCI devices that are removed when a cell is
destroyed but we should also do this on hypervisor shutdown to avoid
that those device later on annoy Linux with unexpected activities.
The change is bigger as it re-indents the shutdown loop to maintain
readability.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Thu, 7 May 2015 17:10:20 +0000 (19:10 +0200)]
core: Do not program MSI-X vectors that are masked
Test for both function-level and vector-level masking before updating a
MSI-X interrupt mapping. Otherwise, we risk to let cells stumble over
stall but masked vector entries.
All accesses to a vector table entry now cause a mapping update. The
vector control dword is always cached to simplify testing it.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Thu, 7 May 2015 16:14:53 +0000 (18:14 +0200)]
x86: Fix vtd int-remap region release
Tiny mistake, but it had the effect of only releasing the first MSI or
MSI-X vector of a PCI device on removal. The succeeding ones remained
both active for vtd and occupied for vtd_reserve_int_remap_region.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Wed, 6 May 2015 07:12:05 +0000 (09:12 +0200)]
x86: Reject xAPIC accesses while in x2APIC mode
If the APIC is in x2APIC mode, accesses via MMIO are not working (APIC
behaves like disabled). If Jailhouse executes them, it can be tricked to
access x2APIC registers that are invalid, causing a hypervisor-side #GP.
Prevent this by bailing out early.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Wed, 6 May 2015 05:43:47 +0000 (07:43 +0200)]
inmates: x86: Enable MTRRs during start to avoid disable caches
Since fe8fac80d7, emulation of the MTRR enable bit works. That has no
effect on KVM so far, but we effectively run with hand break put on over
real hardware.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Fri, 1 May 2015 13:04:27 +0000 (15:04 +0200)]
x86: Allow access to Focus Processor Checking bit in APIC SVR
The Intel manual says: "In Pentium 4 and Intel Xeon processors, this bit
is reserved and should be cleared to 0." It apparently refers to the
first Xeon series here, not newer ones that support IA32e. Linux sets
this bit on x86-64 unconditionally for more than a decade. There are no
availability restrictions mentioned for AMD at all.
So let's release this bit to the cells because it cannot cause any harm
to the system or the hypervisor.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Sat, 2 May 2015 10:34:39 +0000 (12:34 +0200)]
x86: Hand over the APIC in soft-disabled state
This brings the Spurious-Interrupt Vector Register into its well-defined
reset state before handing the APIC over. Avoids surprises for cells and
the need for additional explanations.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Mon, 4 May 2015 17:38:37 +0000 (19:38 +0200)]
arm: Remove ancient compiler bug test via __asmeq
This macro was once copied in from the Linux kernel. There it tries to
catch buggy gcc 3.x versions that didn't follow the specified register
assignments (see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=15089).
This bug is now 10 years old, fixed, and affected compilers that weren't
even aware of the virt extensions for ARMv7 that we depend on anyway. So
let's remove it.
This also removes a GPL'ed line of code, thus enables a dual-licensing
of the file.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Fri, 20 Feb 2015 08:46:32 +0000 (09:46 +0100)]
core: Add BSD 2-Clause license to configuration format header
This avoids having to distribute configuration files for target systems
under GPL terms. It also allows to process those files with differently
licensed management tools.
Contributions came from Valentine, Henning and me. I'm signing off for
Henning as well in the name of Siemens.
Jan Kiszka [Sun, 5 Apr 2015 09:55:07 +0000 (11:55 +0200)]
x86: Do not call vmload/vmsave on every VM exit
Benchmarks indicate that we can gain about 160 cycles per VM exit &
reentry by only saving/restoring MSR_GS_BASE. We don't touch the other
states that vmload/vmsave deals with.
Specifically, we don't depend on a valid TR/TSS while in root mode
because Jailhouse has neither in userspace nor uses the IST for
interrupts or exceptions, thus does not try to access the TSS.
We still need to perform vmload on handover (actually, we only need to
load MSR_GS_BASE, but vmload is simpler) and after VCPU reset. And as we
no longer save the full state, also for shutdown, we need to pull the
missing information for arch_cpu_restore directly from the registers.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Sun, 5 Apr 2015 08:52:32 +0000 (10:52 +0200)]
x86: Make FS_BASE MSR restoration VMX-specific
SVM does not touch this MSR on VM exit, thus does not require the
restoration done in arch_cpu_restore so far. Make it VMX-specific so
that we can drop a few lines of code.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Sun, 5 Apr 2015 07:19:33 +0000 (09:19 +0200)]
x86: Make SYSENTER MSR restoration VMX-specific
SVM does not overwrite these MSRs on VM exit, thus does not require the
restoration done in arch_cpu_restore so far. Make them VMX-specific so
that we can drop a few lines of code.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Jan Kiszka [Sat, 4 Apr 2015 11:27:59 +0000 (13:27 +0200)]
x86: Refactor SVM version of vcpu_activate_vmm
We can reduce the assembly required in vcpu_activate_vmm by reordering
svm_vmexit to svm_vmentry, i.e. pulling the VM entry logic to the front.
Moreover, RAX can be loaded directly. There is furthermore no need to
declare clobbered variables as we won't return from the assembly block,
which is already declared via __builtin_unreachable.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>