Feed on
Posts
Comments

Here’s one for the future troubleshooting seekers. I was testing IPv4 UEFI PXE booting a Supermicro A1SAi motherboard after applying the Atom 2550 fix and couldn’t get the thing to load the network bootstrap program (NBP). I’m not at all saying this is the only reason for hitting a PXE-E99 error, this is just what I hit today.

This blipped by on VGA console so fast I had to use slo-mo on my phone to capture it. (With the Atom 2550 fix, console redirection is lost, so no serial console scrollback).

Checking Media Presence.....
Media Present....
Start PXE over IPv4. Press ESC key to abort PXE boot.
Station IP address is 192.168.135.29

Server IP address 192.168.130.10
NBP filename is /efi64/syslinux.efi
NBP filesize is 0 Bytes
PXE-E99: Unexpected network error.

My PXE environment is pretty set it stone, it’s configuration managed and doesn’t get changed willy-nilly.

Things that came to mind:

  • Wrong EFI binary? Did this particular firmware want some weird 32-bit EFI program? Possible, but I had an near identical motherboard with an older firmware that loaded the x86-64 just fine. Plus I’m near 100% certain I’ve used this same binary on this same motherboard before it died.
  •  TFTP server broken? No, I was able to fetch syslinux.efi on several other machines on the same LAN just fine.
  •  UEFI not support routing / off-network TFTP server? This seemed feasible, but yet I’m absolutely certain I’ve PXE installed these on different LANs than the TFTP server.
  •  Does the UEFI firmware not like it when two DHCP servers respond? While both of my DHCP servers run identical configuration files and reservations, this was easy to test by stopping one temporarily. Didn’t help.

Tcpdumping on both the TFTP server and the router facing the LAN this motherboard was attached on showed there was absolutely no attempts on the wire made to fetch anything over TFTP.

The only thing left was DHCP. The “unexpected network error” got me digging into the DHCP responses being sent back:

01:35:21.075721 IP (tos 0x0, ttl 64, id 4266, offset 0, flags [DF], proto UDP (17), length 354)
    192.168.130.12.67 > 192.168.135.1.67: [bad udp cksum 0x8bbe -> 0xd3ef!] BOOTP/DHCP, Reply, length 326, hops 1, xid 0xafb187ee, Flags [Broadcast] (0x8000)
          Your-IP 192.168.135.29
          Server-IP 192.168.130.10
          Gateway-IP 192.168.135.1
          Client-Ethernet-Address 0c:c4:7a:32:27:e0
          file "/efi64/syslinux.efi"
          Vendor-rfc1048 Extensions
            Magic Cookie 0x63825363
            DHCP-Message Option 53, length 1: ACK
            Server-ID Option 54, length 4: 192.168.130.12
            Lease-Time Option 51, length 4: 86400
            Subnet-Mask Option 1, length 4: 255.255.255.0
            Default-Gateway Option 3, length 4: 192.168.130.1             <<<<<<<<<<<<<<<<<<<<<<
            Domain-Name-Server Option 6, length 12: 192.168.135.1,192.168.130.10,192.168.130.12
            Hostname Option 12, length 16: "basic10.wann.net"
            Domain-Name Option 15, length 8: "wann.net"
            BR Option 28, length 4: 192.168.130.255                       <<<<<<<<<<<<<<<<<<<<<<
            NTP Option 42, length 8: 192.168.130.10,192.168.130.12

After a while what stood out to me was that my DHCP server was returning a Default-Gateway of 192.168.130.1 in the RFC1497 Options section compared to the Gateway-IP/GIADDR set earlier in the packet. Also the broadcast address being set in the Options too.

It would seem that I have a misconfiguration in my ISC DHCP server somewhere. That’s what’s causing the wrong gateway to be returned, and makes sense in that the UEFI loader gets an address but can’t reach anything off the local subnet. Apparently all the previous times I’ve PXE booted these systems I’ve always used IPv6 and never hit this problem until I tested with IPv4. I hacked on my config to temporarily set the gateway manually to what it should be for the test host, and it PXE booted off the network just fine, using the TFTP server on the other LAN.

As to how my DHCP server configuration is wrong, I haven’t figured it out yet. I never put time to understanding how classes and groups were supposed to go, my config looked something like this:

option ntp-servers 192.168.130.10, 192.168.130.12;
option domain-name-servers 192.168.130.10, 192.168.130.12;

subnet 192.168.130.0 netmask 255.255.255.0 {
  next-server 192.168.130.1;
  option routers 192.168.130.1;

  host a {
     hardware ethernet ...
     fixed-address ...
  }
  host b {
    ...
  }
}

subnet 192.168.135.0 netmask 255.255.255.0 {
  next-server 192.168.135.1;
  option routers 192.168.130.1;

  host c {
     hardware ethernet ...
     fixed-address ...
  }
  host d {
    ...
  }
}

And in this example, for whatever reason booting host “D” would get the router from the other subnet. Every subnet example I’ve seen shows putting “options routers” in a subnet scope. I do have some some groups and classes in there that’s clearly fowling things up but I don’t see how.

From what I’ve been reading, class and host are top-level scopes and shouldn’t go inside subnet.

So I just re-wrote my DHCP configuration into what I believe now are the right scopes:

...
option ntp-servers 192.168.130.10, 192.168.130.12;
option domain-name-servers 192.168.130.10, 192.168.130.12;
next-server 192.168.130.10;

# For things that match this class, override global options
class "pxeclients" {
  match if substring (option vendor-class-identifier, 0, 9) = "PXEClient";

  if option arch = 00:07 {
    filename "/efi64/syslinux.efi";
  } else {
    # PXELINUX >= 5.X is the new hotness with HTTP/FTP
    filename "/bios/lpxelinux.0";
  }
}

subnet 192.168.130.0 netmask 255.255.255.0 {
  option routers 192.168.130.1;
  include "/etc/dhcp/homenet-130.inc";
}

subnet 192.168.135.0 netmask 255.255.255.0 {
  option domain-name-servers 192.168.135.1,192.168.130.10,192.168.130.12;
  option routers 192.168.135.1;
}

subnet 192.168.136.0 netmask 255.255.255.0 {
  option routers 192.168.136.1;
  include "/etc/dhcp/homenet-136.inc";
}

group homenet {
  host a {
    fixed-address 192.168.130.x
    ...
  }
  host b {
    fixed-address 192.168.130.x
    ...
  }
}

group otherstuff {
  host c {
    fixed-address 192.168.135.x
  }
  host d {
    fixed-address 192.168.135.x
  }
}

Instead of putting a “host” inside of the “subnet” scope, I put them all at the top level. Apparently dhcpd just “knows” that a host belongs to a subnet based upon the fixed-address matching the subnet+mask given in the subnet declaration, instead of trying to use the “subnet” scope to organize “host” entries.

After doing this, factoring out some duplicate classes to set the filename, UEFI PXE booting my test system worked on the first try! I’m past the point of caring now to bisect my original config, this was a detour from doing other things. Besides, I should be burning all of this down and finishing my migration to the Kea DHCP server.

Supermicro / Intel Atom 2550 fix

I’ve loved using the A1SRi motherboards in ikeacluster for years, they offered a lot of RAM power with a fan-less embedded CPU. I’ve ran them for several years and had a couple eventually succumb to the Atom C2000 clock failure problems. The BMC would work but the motherboard would just decide one day to stop booting.

For years I wrote these off as dead and what I thought as beyond the period at which people were getting Supermicro to replace them. I was sad there was not a great replacement. I seem to recall the Denverton mini-ITX boards were either late or were considerably more expensive. There were some other A1SRi boards on ebay, but they were expensive and who knows when they’d die.

A while back I had read on some forums where people were working on a fix to revive these motherboards. I finally got around to reading up on the Serve The Home thread and a Truenas thread to try it out. On my boards I ran a 200 ohm jumper between pins 1 and 9 on the TPM header and that seemed to do the trick:

TPM header jumper

It still has the problem where the BMC is still alive and I can control VGA+keyboard input, but Linux ipmitool can’t query the BMC nor does serial redirection work. But I’m happy I was able to revive two of my motherboards! It also looks like even as late as 2023 people were getting these boards replaced by Supermicro, so I might have to give that a try.

Update 27-Oct-2024:

huh, I thought I had read this fix breaks connectivity between the OS and the BMC, but serial-over-LAN and things like ipmitool lan print work:

ipmitool sol activate after resistor fix

ipmitool viewing BMC sensors and network settings

I’ll take it!

Leave a Reply