hello. I'm running, or attempting to run, FreeBSD-12.1 with xen-4.12
with HVM machines, using Zvols as backingstore for the virtual disks. I'm running into an issue where after I write to the disk with the VM, filesystems on the disk become corrupted, including corruption of the disk label and partition table. zpool status shows no errors, even after I perform a scrub. ZFS might be a coincidence and it may be a problem with Xen or Qemu, which is why I'm writing here. To do my install, I configured two disks for my VM, one to use as a boot/root disk and the other to build the system I'm virtualizing. Interestingly enough, I do not see corruption on the virtual boot disk, only on the secondary disk. My config is shown below. The VM is a NetBSD-5.2 system. I'm running many 5.2 systems, both as real hardware and as virtual machines without trouble, so there is something about this setup that's the problem, as opposed to the OS in the vm. The dmesg for the VM is shown below as well. Are secondary disks as rw disks not supported? -thanks -Brian Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010 The NetBSD Foundation, Inc. All rights reserved. Copyright (c) 1982, 1986, 1989, 1991, 1993 The Regents of the University of California. All rights reserved. NetBSD 5.2_STABLE (ALTQ) #0: Fri Feb 8 01:00:12 PST 2019 [hidden email]:/usr/src/sys/arch/i386/compile/ALTQ total memory = 3839 MB avail memory = 3761 MB timecounter: Timecounters tick every 1.000 msec timecounter: Timecounter "i8254" frequency 1193182 Hz quality 100 Xen HVM domU (4.12.1) PCI BIOS rev. 2.1 found at 0xfd1c8 pcibios: config mechanism [1][x], special cycles [x][x], last bus 0 PCI IRQ Routing Table rev. 1.0 found at 0xf5d20, size 128 bytes (6 entries) PCI Interrupt Router at 000:01:0 (Intel 82371FB (PIIX) PCI-ISA Bridge compatible) mainbus0 (root) cpu0 at mainbus0 apid 0: Intel 686-class, 3093MHz, id 0x206a7 cpu1 at mainbus0 apid 2: Intel 686-class, 3093MHz, id 0x206a7 ioapic0 at mainbus0 apid 1: pa 0xfec00000, version 11, 48 pins acpi0 at mainbus0: Intel ACPICA 20080321 acpi0: X/RSDT: OemId < Xen, HVM,00000000>, AslId <HVML,00000000> acpi0: SCI interrupting at int 9 acpi0: fixed-feature power button present acpi0: fixed-feature sleep button present timecounter: Timecounter "ACPI-Safe" frequency 3579545 Hz quality 900 ACPI-Safe 32-bit timer hpet0 at acpi0 (HPET, PNP0103-0): mem 0xfed00000-0xfed003ff timecounter: Timecounter "hpet0" frequency 62500000 Hz quality 2000 attimer1 at acpi0 (TMR, PNP0100): io 0x40-0x43 irq 0 pcppi1 at acpi0 (SPKR, PNP0800): io 0x61 midi0 at pcppi1: PC speaker (CPU-intensive output) spkr0 at pcppi1 sysbeep0 at pcppi1 pckbc1 at acpi0 (PS2M, PNP0F13) (aux port): irq 12 pckbc2 at acpi0 (PS2K, PNP0303) (kbd port): io 0x60,0x64 irq 1 FDC0 (PNP0700) [PC standard floppy disk controller] at acpi0 not configured UAR1 (PNP0501) [16550A-compatible COM port] at acpi0 not configured apm0 at acpi0: Power Management spec V1.2 attimer1: attached to pcppi1 pckbd0 at pckbc2 (kbd slot) pckbc2: using irq 1 for kbd slot wskbd0 at pckbd0 mux 1 pms0 at pckbc2 (aux slot) pckbc2: using irq 12 for aux slot wsmouse0 at pms0 mux 0 pci0 at mainbus0 bus 0: configuration mode 1 pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok pchb0 at pci0 dev 0 function 0 pchb0: Intel 82441FX (PMC) PCI and Memory Controller (rev. 0x02) pcib0 at pci0 dev 1 function 0 pcib0: Intel 82371SB (PIIX3) PCI-ISA Bridge (rev. 0x00) piixide0 at pci0 dev 1 function 1 piixide0: Intel 82371SB IDE Interface (PIIX3) (rev. 0x00) piixide0: bus-master DMA support present piixide0: primary channel wired to compatibility mode piixide0: primary channel interrupting at ioapic0 pin 14 atabus0 at piixide0 channel 0 piixide0: secondary channel wired to compatibility mode piixide0: secondary channel interrupting at ioapic0 pin 15 atabus1 at piixide0 channel 1 piixpm0 at pci0 dev 1 function 3 piixpm0: Intel 82371AB (PIIX4) Power Management Controller (rev. 0x03) timecounter: Timecounter "piixpm0" frequency 3579545 Hz quality 1000 piixpm0: 24-bit timer piixpm0: SMBus disabled XenSource, Inc. Xen Platform Device (undefined subclass 0x80, revision 0x01) at pci0 dev 2 function 0 not configured vga1 at pci0 dev 3 function 0: Cirrus Logic CL-GD5446 (rev. 0x00) wsdisplay0 at vga1 kbdmux 1 wsmux1: connecting to wsdisplay0 wskbd0: connecting to wsdisplay0 drm at vga1 not configured wm0 at pci0 dev 4 function 0: Intel i82540EM 1000BASE-T Ethernet, rev. 3 wm0: interrupting at ioapic0 pin 32 wm0: 32-bit 33MHz PCI bus wm0: 64 word (6 address bits) MicroWire EEPROM wm0: Ethernet address 00:4e:46:42:42:ce makphy0 at wm0 phy 1: Marvell 88E1011 Gigabit PHY, rev. 0 makphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto isa0 at pcib0 com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo com0: console isapnp0 at isa0 port 0x279: ISA Plug 'n Play device support npx0 at isa0 port 0xf0-0xff npx0: reported by CPUID; using exception 16 fdc0 at isa0 port 0x3f0-0x3f7 irq 6 drq 2 isapnp0: no ISA Plug 'n Play devices found timecounter: Timecounter "clockinterrupt" frequency 1000 Hz quality 0 timecounter: Timecounter "TSC" frequency 3093100120 Hz quality 3000 wd0 at atabus0 drive 0: <QEMU HARDDISK> wd0: drive supports 16-sector PIO transfers, LBA48 addressing wd0: 1045 GB, 2174281 cyl, 16 head, 63 sec, 512 bytes/sect x 2191675392 sectors wd0: 32-bit data port wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100) wd1 at atabus0 drive 1: <QEMU HARDDISK> wd1: drive supports 16-sector PIO transfers, LBA48 addressing wd1: 1045 GB, 2174281 cyl, 16 head, 63 sec, 512 bytes/sect x 2191675392 sectors wd1: 32-bit data port wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100) wd0(piixide0:0:0): using PIO mode 4, DMA mode 2 (using DMA) wd1(piixide0:0:1): using PIO mode 4, DMA mode 2 (using DMA) IPsec: Initialized Security Association Processing. Kernelized RAIDframe activated boot device: wd0 root on wd0a dumps on wd0b piixide0:0:0: lost interrupt type: ata tc_bcount: 512 tc_skip: 0 root file system type: ffs wsdisplay0: screen 1 added (80x25, vt100 emulation) wsdisplay0: screen 2 added (80x25, vt100 emulation) wsdisplay0: screen 3 added (80x25, vt100 emulation) wsdisplay0: screen 4 added (80x25, vt100 emulation) Configuration file here type = "hvm" # Initial memory allocation (in megabytes) for the new domain. memory = 4096 # Number of Virtual CPUS to use, default is 1 vcpus = 2 # A name for your domain. All domains must have different names. name = "test_nbsd" #---------------------------------------------------------------------------- # network configuration. # The mac address is optional, it will use a random one if not specified. # By default we create a bridged configuration; when a vif is created # the script /usr/pkg/etc/xen/scripts/vif-bridge is called to connect # the bridge to the designated bridge (the bridge should already be up) vif = [ 'type=ioemu,model=e1000,mac=00:4e:46:42:42:ce,bridge=bridge0' ] #---------------------------------------------------------------------------- # Define the disk devices you want the domain to have access to, and # what you want them accessible as. # Each disk entry is of the form phy:UNAME,DEV,MODE # where UNAME is the device, DEV is the device name the domain will see, # and MODE is r for read-only, w for read-write. # For NetBSD guest DEV doesn't matter, so we can just use increasing numbers # here. For linux guests you have to use a linux device name (e.g. hda1) # or the corresponding device number (e.g 0x301 for hda1) disk = [ '/dev/zvol/xendisks/nb_backup,raw,hda,rw', '/dev/zvol/xendisks/nb_root,raw,hdb,rw' ] #Boot from the hard drive boot = 'c' # Turn off graphics #sdl = 1 #Turn on the serial port as console serial = 'pty' #---------------------------------------------------------------------------- # Boot parameters (e.g. -s, -a, ...) extra = "" #============================================================================ #Reboot after shutdowns #autorestart = True on_poweroff = "restart" _______________________________________________ [hidden email] mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-xen To unsubscribe, send any mail to "[hidden email]" |
hello. Following up on my own message, I believe I've run into a
serious problem that exists on FreeBSD-xen with FreeBSD-12.1P10 and Xen-4.14.0. Just in case I was running into an old bug with yesterday's post, I updated to xen-4.14.0 and Qemu-5.0.0. the problem was still there, i.e. when writing to a second virtual hard drive on an hvm domu, the drive becomes corrupted. Again, zpool scrub shows no errors. So, I decided it might be some sort of memory error. I wrote a memory test program, shown below, and ran it on my hvm domu. It not only crashed the domu itself, it crashed the entire xen server! There are some dmesg messages that happened before the xen server crash, shown below, which suggest a serious problem. In my view, no matter how badly the domu hvm host behaves, it shouldn't be able to crash the xen server itself! The domu is running NetBSD-5.2, an admittedly old version of the operating system, but I'm running a fleet of these machines, both on real hardware and on older versions of xen with no stability issues whatsoever! And, as I say, I shouldn't be able to wipe out the xen server from an hvm domu, no matter what I do! The memory test program takes one argument, the amount of RAM, in megabytes, you want it to test. It then allocates that memory, and sequentially walks through that memory over and over again, writing to it and reading from it, checking to make sure the data read matches the data written. this has the effect of causing the resident set size of the program to grow slowly over time, as it works. It was originally written to test the paging efficiency of a system, but I modified it to actually test the memory along the way. to reproduce the issue, perform the following steps: 1. Set up an hvm host, I think FreeBSD as a domu hvm host will work fine. Use zfs zvols as the backingstore for the virtual disk(s) for your host. 2. Compile this program for that host and run it as follows: ./testmem 1000 This should ask the program to allocate 1G of memory and then walk through and test it. It will report each megabyte of memory it's written and tested. My test hvm had 4G of RAM as it was a 32-bit OS running on the domu. Nothing else was running on either the xen server or the domu host. I'm not sure exactly how far the program got in its memory walk before things went south, but I think it touched about 100 megabytes of its 1000 megabyte allocation. My program was not running as root, so it had no special privileges, even on the domu host. I'm not sure if the problem is with qemu, xen, or some combination of the two. It would be great if someone could reproduce this issue and maybe shed a bit more light on what's going on. -thanks -Brian <error messages on xen server just before the crash!> Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_ring2pkt:1534): Unknown extra info type 255. Discarding packet Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:304): netif_tx_request index =0 Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:305): netif_tx_request.gref =0 Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:306): netif_tx_request.offset=0 Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:307): netif_tx_request.flags =8 Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:308): netif_tx_request.id =69 Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:309): netif_tx_request.size =1000 Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:304): netif_tx_request index =1 Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:305): netif_tx_request.gref =255 Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:306): netif_tx_request.offset=0 Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:307): netif_tx_request.flags =0 Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:308): netif_tx_request.id =0 Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:309): netif_tx_request.size =0 Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_rxpkt2rsp:2068): Got error -1 for hypervisor gnttab_copy status Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_ring2pkt:1534): Unknown extra info type 255. Discarding packet Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:304): netif_tx_request index =0 Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:305): netif_tx_request.gref =0 Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:306): netif_tx_request.offset=0 Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:307): netif_tx_request.flags =8 Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:308): netif_tx_request.id =69 Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:309): netif_tx_request.size =1000 Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:304): netif_tx_request index =1 Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:305): netif_tx_request.gref =255 Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:306): netif_tx_request.offset=0 Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:307): netif_tx_request.flags =0 Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:308): netif_tx_request.id =0 Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:309): netif_tx_request.size =0 Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_rxpkt2rsp:2068): Got error -1 for hypervisor gnttab_copy status <cut here for test program, testmem.c> /************************************************************************** NAME: Brian Buhrow DATE: November 11, 2020 PURPOSE: This program allocates the indicated number of megabytes of ram, and then proceeds to touch each page to insure that it gets brought into core memory. In this way, this program attempts to exercise the OS's VM paging system. It then checks to see if there is any memory corruption by re-reading each segment that it writes. Terminate the program by hitting control-c. **************************************************************************/ static char rcsid[] = "$Id: testmem.c,v 1.2 2020/11/11 08:06:27 buhrow Exp $"; #include <stdio.h> #include <string.h> #include <malloc.h> #include <unistd.h> #define TESTSTR "This is a test\n" main(argc, argv) int argc; char **argv; { int i, pgsize, bufsiz, requested, testindex, testlen; char *buf, *ptr; char tmpbuf[1024]; if (argc != 2) { printf("Usage: %s <size in megabytes>\n",argv[0]); return(0); } sscanf(argv[1],"%d",&requested); if (!requested) { printf("%s: You must request more than 0 MB of RAM.\n",argv[0]); return(0); } bufsiz = requested * (1024 * 1024); printf("%s: Allocating %dMB of RAM (%u bytes)\n",argv[0],requested,bufsiz); buf = (char *)malloc(bufsiz); if (!buf) { sprintf(tmpbuf,"%s: Unable to allocate memory",argv[0]); perror(tmpbuf); exit(1); } printf("%s: Memory allocated, starting at address: 0x%8x\n",argv[0],buf); pgsize = getpagesize(); testindex = 65; for(;;) { bzero(tmpbuf, 1024); sprintf(tmpbuf, "%s%c\n",TESTSTR,testindex); testindex += 1; if (testindex > 126) testindex = 65; testlen = strlen(tmpbuf); for (i = 0;i < bufsiz;i += testlen) { ptr = &buf[i]; bcopy(tmpbuf, ptr, testlen); if ((i % (1024 * 1024)) <= 15) { printf("%u MB touched...\n",i / (1024 * 1024)); sleep(5); } } for (i = 0;i < bufsiz;i += testlen) { if (memcmp(tmpbuf, ptr, testlen) != 0) { printf("Memory error near 0x%x\n",ptr); } if ((i % (1024 * 1024)) <= 15) { printf("%u MB checked...\n",i / (1024 * 1024)); sleep(5); } } } /*not reached*/ exit(0); } _______________________________________________ [hidden email] mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-xen To unsubscribe, send any mail to "[hidden email]" |
On Wed, Nov 11, 2020 at 01:13:18AM -0800, Brian Buhrow wrote:
> hello. Following up on my own message, I believe I've run into a > serious problem that exists on FreeBSD-xen with FreeBSD-12.1P10 and > Xen-4.14.0. Just in case I was running into an old bug with yesterday's > post, I updated to xen-4.14.0 and Qemu-5.0.0. the problem was still there, > i.e. when writing to a second virtual hard drive on an hvm domu, the drive > becomes corrupted. Again, zpool scrub shows no errors. Are you using volmode=dev when creating the zvol? # zfs create -V16G -o volmode=dev zroot/foo This is require when using zvol with bhyve, but shouldn't' be required for Xen since the lock the guest disks from the kernel so GEOM cannot taste them. > So, I decided it might be some sort of memory error. I wrote a memory > test program, shown below, and ran it on my hvm domu. It not only > crashed the domu itself, it crashed the entire xen server! There are some > dmesg messages that happened before the xen server crash, shown below, which > suggest a serious problem. In my view, no matter how badly the domu hvm > host behaves, it shouldn't be able to crash the xen server itself! The > domu is running NetBSD-5.2, an admittedly old version of the operating > system, but I'm running a fleet of these machines, both on real hardware > and on older versions of xen with no stability issues whatsoever! And, as > I say, I shouldn't be able to wipe out the xen server from an hvm domu, no > matter what I do! Can you please paste the config file of the domain? > > The memory test program takes one argument, the amount of RAM, in > megabytes, you want it to test. It then allocates that memory, and > sequentially walks through that memory over and over again, writing to it > and reading from it, checking to make sure the data read matches the data > written. this has the effect of causing the resident set size of the > program to grow slowly over time, as it works. It was originally written > to test the paging efficiency of a system, but I modified it to actually > test the memory along the way. > to reproduce the issue, perform the following steps: > > 1. Set up an hvm host, I think FreeBSD as a domu hvm host will work fine. > Use zfs zvols as the backingstore for the virtual disk(s) for your host. > > 2. Compile this program for that host and run it as follows: > ./testmem 1000 > This should ask the program to allocate 1G of memory and then walk through > and test it. It will report each megabyte of memory it's written and > tested. My test hvm had 4G of RAM as it was a 32-bit OS running on the > domu. Nothing else was running on either the xen server or the domu host. > I'm not sure exactly how far the program got in its memory walk before > things went south, but I think it touched about 100 megabytes of its 1000 > megabyte allocation. > My program was not running as root, so it had no special privileges, even > on the domu host. > > I'm not sure if the problem is with qemu, xen, or some combination of > the two. > > It would be great if someone could reproduce this issue and maybe shed > a bit more light on what's going on. > > -thanks > -Brian > > <error messages on xen server just before the crash!> > > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_ring2pkt:1534): Unknown extra info type 255. Discarding packet > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:304): netif_tx_request index =0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:305): netif_tx_request.gref =0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:306): netif_tx_request.offset=0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:307): netif_tx_request.flags =8 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:308): netif_tx_request.id =69 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:309): netif_tx_request.size =1000 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:304): netif_tx_request index =1 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:305): netif_tx_request.gref =255 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:306): netif_tx_request.offset=0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:307): netif_tx_request.flags =0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:308): netif_tx_request.id =0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:309): netif_tx_request.size =0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_rxpkt2rsp:2068): Got error -1 for hypervisor gnttab_copy status > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_ring2pkt:1534): Unknown extra info type 255. Discarding packet > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:304): netif_tx_request index =0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:305): netif_tx_request.gref =0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:306): netif_tx_request.offset=0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:307): netif_tx_request.flags =8 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:308): netif_tx_request.id =69 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:309): netif_tx_request.size =1000 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:304): netif_tx_request index =1 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:305): netif_tx_request.gref =255 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:306): netif_tx_request.offset=0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:307): netif_tx_request.flags =0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:308): netif_tx_request.id =0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:309): netif_tx_request.size =0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_rxpkt2rsp:2068): Got error -1 for hypervisor gnttab_copy status Do you have a serial line attached to the server, and if so are those the last messages that you see before the server reboots? I would expect some kind of panic from the FreeBSD dom0 kernel or Xen itself before the server reboots. Those error messages are actually from the PV network controller, so I'm not sure they are related to the disk in any way. Are you doing anything else when this happens? Roger. _______________________________________________ [hidden email] mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-xen To unsubscribe, send any mail to "[hidden email]" |
hello Roger. thanks for engaging with me on this issue. I think
I've made progress on the issue and have a better handle on what's going wrong. There seem to be a cascade of bugs here, which I'll try to enumerate. 1. The disk corruption issue seems to be a bug in qemu whereby the emulated IDE disk controller issues partial writes instead of full writes or no writes with appropriate failure to the disk. The IDE driver in NetBSD-5.2 doesn't play well with this behavior, in fact, NetBSD until May of 2020, doesn't play well with this behavior See: http://mail-index.NetBSD.org/source-changes/2020/05/24/msg117668.html 2. This causes memory corruption in the OS itself, which can trigger a xen server crash! (In my view, no matter how badly behaved the guest OS is, it shouldn't be able to bring down the xen server.) 3. Running NetBSD-5.2/i386 as a domu, which works flawlessly under xen3, gets a panic: HYPERVISOR_mmu_update failed I suspect this can be worked around using some of the command line options under Xen, xpti=true,domu=false, perhaps? Are there others I should consider? If I can get the domu kernel working, or can back port the patch listed in bug 1 above, I should be off to the races. Still, I think there's a serious issue here in bug 2, listed above that ought to be looked at. Unfortunately, I don't hav a way to readily reproduce it. Any thoughts on how to achieve pv32 backward compatibility with xen3 would be greatly appreciated. Pv64 NetBSD-5.2 seems to work fine. -thanks -Brian _______________________________________________ [hidden email] mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-xen To unsubscribe, send any mail to "[hidden email]" |
On Thu, Nov 12, 2020 at 09:18:59PM -0800, Brian Buhrow wrote:
> hello Roger. thanks for engaging with me on this issue. I think > I've made progress on the issue and have a better handle on what's going > wrong. There seem to be a cascade of bugs here, which I'll try to > enumerate. > > 1. The disk corruption issue seems to be a bug in qemu whereby the > emulated IDE disk controller issues partial writes instead of full writes > or no writes with appropriate failure to the disk. The IDE driver in > NetBSD-5.2 doesn't play well with this behavior, in fact, NetBSD until May > of 2020, doesn't play well with this behavior > See: > http://mail-index.NetBSD.org/source-changes/2020/05/24/msg117668.html Oh great, so it's something specific to NetBSD. This means you are running NetBSD in HVM mode? > 2. This causes memory corruption in the OS itself, which can trigger a xen > server crash! (In my view, no matter how badly behaved the guest OS is, it > shouldn't be able to bring down the xen server.) Right, we really need the trace from this crash. Do you have a serial hooked up to the box so that you can provide the panic message? > 3. Running NetBSD-5.2/i386 as a domu, which works flawlessly under xen3, gets a > panic: HYPERVISOR_mmu_update failed This would imply that you are running NetBSD in PV mode, in which case it won't be using the emulated hard disk drive, and hence the commit you referenced above would be unrelated. Can you assert whether you are running NetBSD in PV or HVM mode? > I suspect this can be worked around using some of the command line > options under Xen, xpti=true,domu=false, perhaps? > Are there others I should consider? Maybe. I think you should report this to xen-devel and NetBSD/Xen mailing lists, with a full trace of the crash and the guest config file. Roger. _______________________________________________ [hidden email] mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-xen To unsubscribe, send any mail to "[hidden email]" |
hello Roger. Sorry for the confusion. I've been running NetBSD, or
trying to, in both PV and HVM modes. Here is a mesage I sent to the [hidden email] mailing list a few days ago, detailing what I found with running NetBSD in any mode under Xen as a domu. Unfortunately, I don't think I can provide a trace of a crash to the xen server when the HVM host misbehaves because I've had to move on in my work to get things working and I found a working combination of NetBSD and ZFS that let's me proceed. I've seen some discussion in the Xen documentation that suggests one can run a Xen dom0 in a domu guest, allowing for sand box testing of what you're describing without having to devote hardware to the issue. Am I correct in this and where might I find instructions on how to do it? In any case, here's what I found. -thanks -Brian From: Brian Buhrow <[hidden email]> Date: Mon, 16 Nov 2020 14:59:48 -0800 Subject: Re: Panic: HYPERVISOR_mmu_update failed on old NetBSD-5.2 domU Cc: [hidden email], [hidden email] hello Greg. thanks for the pointers on Xen changes over the years and the reasons for the changes. After a lot more testing, thought and cvs diffing, here is the state of the world, as I understand it, for the record. NetBSD-5/i386 won't run under Xen-4.12 or newer, possibly older versions of Xen, I didn't test, because it still has calls to xpq_queue_flush and the like which were used in pre-xen3 days. I believe, through code inspection, but didn't test, that this is also true of NetBSD-6/i386 and NetBSD-7/i386. My next thought was to use NetBSD-5 in HVM mode, but it turns out this is a show stopper for versions of NetBSD all the way through -current as of May 24, 2020, see the link below, due to a cascade of bugs in the piixide(4) emulator in Qemu and NetBSD-s inability to deal with partial success in writing to disks on that chip set. This failure is what causes the original disk corruption I wrote about at the beginning of this thread. As an alternative, I tried enabling the SCSI LSI emulator in Qemu, but this doesn't work because our esiop(4) driver doesn't play well with the emulated LSI chips in Qemu. No disk corruption, just timeouts when writing to attached emulated sd(4) devices. So, NetBSD as an HVM guest is pretty much a bust! Fortunately, there is a solution. NetBSD-5.2/amd64 DOMU's work fine under Xen-4.12 and above. So, I'll run the 64-bit amd-64 base, with a 32-bit pkgsrc set of packages to allow me to migrate my busy, working machine, to newere hardware as a VM, thereby giving me the ability to build a replacement installation for this machine and begin building a new environment without interrupting service on the old one. I have several instances of 64-bit kernels runing with 64-bit base and 32-bit pkgsrc packages running successfully, so, while it's not ideal, it works quite well and gives me space to think about how to move to NetBSD-9 and beyond. Thanks again for the ideas and, hopefully, someone will find this summary useful. -Brian http://mail-index.NetBSD.org/source-changes/2020/05/24/msg117668.html _______________________________________________ [hidden email] mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-xen To unsubscribe, send any mail to "[hidden email]" |
On Wed, Nov 18, 2020 at 11:58:27AM -0800, Brian Buhrow wrote:
> hello Roger. Sorry for the confusion. I've been running NetBSD, or > trying to, in both PV and HVM modes. Here is a mesage I sent to the > [hidden email] mailing list a few days ago, detailing what I found > with running NetBSD in any mode under Xen as a domu. Unfortunately, I > don't think I can provide a trace of a crash to the xen server when the OK, I'm not really able to reproduce this myself, so if you get those crashes again can you please make sure you have a serial attached to the box so that you can provide some trace? > HVM host misbehaves because I've had to move on in my work to get things > working and I found a working combination of NetBSD and ZFS that let's me > proceed. I've seen some discussion in the Xen documentation that suggests > one can run a Xen dom0 in a domu guest, allowing for sand box testing of > what you're describing without having to devote hardware to the issue. Am > I correct in this and where might I find instructions on how to do it? You can test a nested PV dom0, but not a nested PVH dom0, which is what FreeBSD uses to run as dom0. > > In any case, here's what I found. Thanks, I'm sure this will e helpful to others. Roger. _______________________________________________ [hidden email] mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-xen To unsubscribe, send any mail to "[hidden email]" |
Free forum by Nabble | Edit this page |