Introducing Node Doctor

We are happy to announce the availability of Node Doctor, a new OCI Container Engine for Kubernetes (OKE) worker node troubleshooting tool. Node Doctor now comes pre-installed on all OKE worker nodes and helps with the detection of common node problems.

Node Doctor helps you troubleshoot common infrastructure level issues with your OKE cluster worker nodes. For example, when the status of Kubernetes Node Condition is not “Active” or Node State is not “Ready”, Node Doctor provides insights on the underlying problems so you can get your nodes back online. It can also be used to capture useful data to share with Oracle Support.

Node Doctor focuses on common issues related to the intersection between Kubernetes and Oracle Cloud Infrastructure (OCI), the majority of which impact the health of Kubernetes worker nodes. Node Doctor runs a number of checks to ensure a worker node is operating as intended. For example, Node Doctor can be used to indicate if the number of pods on a node is too high causing issues in the kubelet, the primary node agent running on each worker node, or if a node is running a known bad version of a dependency, such as runC, and should be recycled.

Using Node Doctor

 

To troubleshoot node issues simply navigate to the node pool containing the problematic node and click Troubleshoot Nodes. This will open up a dialogue with multiple options for how to access nodes and run Node Doctor. We know our users follow different approaches to ensure their nodes are protected in line with their security practices. Keeping this in mind, we chose to support multiple paths to access nodes and run Node Doctor. Users with SSH access to their worker nodes can connect via SSH and run the command themselves. Users without SSH access can make use of an OCI Compute feature that allows for users with the correct privileges to run commands on a node even without SSH access. For more information about running commands on an OCI Compute host, see Running Commands on an Instance

Core Functionality

Node doctor has two functions, checking for node issues and generating a support bundle:

Checking for Node Issues

sudo /usr/local/bin/node-doctor.sh --check

This command will perform a handful of precondition checks to ensure the foundations of the worker node are in place. This includes verifying whether or not the kubelet, the primary node agent that runs on Kubernetes worker nodes, is active, running the correct version, and can access the Kubernetes API server. After confirming the preconditions have been met, it will check for a variety of common issues and then surface those checks as either a PASS or FAIL next to the issue. In case you need to access the results in the future, it will save the output of the check in a log file. If one of the checks fails, Node Doctor will also print remediation steps and links to documentation where applicable. For example, specific networking related issues, including inactive proxymux certificates or an inaccessible kube-apiserver, will return an output of:

Network related failures have been detected. Please validate the network settings. Common mistakes include not using a service gateway, incorrect security list rules, and specifying the wrong subnet. https://docs.oracle.com/en-us/iaas/Content/ContEng/Concepts/contengnetworkconfig.htm
https://docs.oracle.com/en-us/iaas/Content/ContEng/Concepts/contengnetworkconfigexample.htm"]

This is what the Node Doctor –check command looks like when I run it on a healthy node:

$ sudo /usr/local/bin/node-doctor.sh -c
/usr/local/bin/oke-node-doctor does not exist.
Verified OK
chmod: cannot access ‘oke-node-doctor’: No such file or directory
INFO: Successfully downloaded node doctor.
Running node doctor...
PASS node health...
PASS DNS lookup...
PASS kubelet cert rotation flag...
PASS kubelet logs...
PASS service health...
PASS instance metadata...
PASS image and instance info...
PASS yum status...
PASS flannel status...
PASS coredns status...
PASS proxymux-client status...
PASS kube-proxy status...
PASS pods in ImagePullBackOff...
PASS pods failed mounting volume...
PASS runc version...
PASS pod usage...
​
NODE DOCTOR REPORT
------------------
16/16 checks passed
0 Signal(s) generated
​
Node doctor scan is complete. Report has been saved at /var/log/oke-node-doctor/oke-node-doctor-814.log
 

Generating a Support Bundle

sudo /usr/local/bin/node-doctor.sh --generate

This command will perform the actions of the –check command above and also will generate a support bundle, a .tar file containing diagnostic information that can be shared with Oracle Support. My Oracle Support (MOS) will provide information about how to upload the .tar file containing the bundle to a support ticket.

This is what the Node Doctor –generate command looks like when I run it on a healthy node:

$ sudo /usr/local/bin/node-doctor.sh -g
INFO: /usr/local/bin/oke-node-doctor already exists and MD5 match.
Running node doctor...
PASS node health...
PASS DNS lookup...
PASS kubelet cert rotation flag...
PASS kubelet logs...
PASS service health...
PASS instance metadata...
PASS image and instance info...
PASS yum status...
PASS flannel status...
PASS coredns status...
PASS proxymux-client status...
PASS kube-proxy status...
PASS pods in ImagePullBackOff...
PASS pods failed mounting volume...
PASS runc version...
PASS pod usage...
​
NODE DOCTOR REPORT
------------------
16/16 checks passed
0 Signal(s) generated
​
Node doctor scan is complete. Report has been saved at /var/log/oke-node-doctor/oke-node-doctor-2127.log
Generating node doctor bundle...
Generated /tmp/oke-support-bundle-2021-07-12T18-01-12.tar

Other Helpful Tools: Node Pool Work Requests

We recently revisited the way we expose work requests for node pool and control plane CRUD operations and added detailed information from each request, including log message, error messages, and associated resources. This provides another source for helpful diagnostics in addition to the information made available by Node Doctor. Work requests can be accessed from the console, SDK, CLI, API, and other surfaces. For more information, see Viewing Work Requests

Future Plans

We will offer additional paths to access nodes in the future, like enabling access to users with restricted node access. Node Doctor will continue to be enhanced over time to include additional issues, symptoms, and solutions we uncover.