•  


fallback_ips: Gather facts from reachable hosts only by Rickkwa · Pull Request #11006 · kubernetes-sigs/kubespray · GitHub
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fallback_ips: Gather facts from reachable hosts only #11006

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Rickkwa
Copy link
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it :

When working with a large fleet of nodes, there is inevitably going to be some unreachable nodes. The linked issue describes that when kubespray-defaults runs and there exists an unreachable node, then it causes the play to unexpectedly exit early.

This is a fix to fallback_ips.yml to let it finish the rest of the role when unreachable node(s) exist.

Which issue(s) this PR fixes :

Fixes #10993

Special notes for your reviewer :

Local setup (inventory and test playbook):

details Inventory
[k8s_cluster:children]

kube_control_plane
kube_node
calico_rr

[kube_control_plane]

k8s1.local

[etcd]

k8s1.local

[kube_node]

k8s3.local  
#
 unreachable

k8s2.local

[calico_rr]

Playbook

---
- 
name
: 
Prepare nodes for upgrade

  hosts
: 
k8s_cluster:etcd:calico_rr

  gather_facts
: 
False

  any_errors_fatal
: 
"
{{ any_errors_fatal | default(true) }}
"

  environment
: 
"
{{ proxy_disable_env }}
"

  roles
:
    - 
{ role: kubespray-defaults }


  post_tasks
:
    - 
name
: 
Task that runs after

      debug
:
        
var
: 
fallback_ips

Showing the original output before applying my PR:

details
PLAY [Prepare nodes for upgrade] ********************************************************************************************************************************************************************************

TASK [kubespray-defaults : Gather ansible_default_ipv4 from all hosts] ******************************************************************************************************************************************
ok: [k8s1.local] => (item=k8s1.local)
[WARNING]: Unhandled error in Python interpreter discovery for host k8s1.local: Failed to connect to the host via ssh: ssh: connect to host k8s3.local port 22: Connection timed out
failed: [k8s1.local -> k8s3.local] (item=k8s3.local) => {"ansible_loop_var": "item", "item": "k8s3.local", "msg": "Data could not be sent to remote host \"k8s3.local\". Make sure this host can be reached over ssh: ssh: connect to host k8s3.local port 22: Connection timed out\r\n", "unreachable": true}
ok: [k8s1.local -> k8s2.local] => (item=k8s2.local)
fatal: [k8s1.local -> {{ item }}]: UNREACHABLE! => {"changed": false, "msg": "All items completed", "results": [{"ansible_facts": {"ansible_default_ipv4": {"address": "10.88.111.29", "alias": "eth0", "broadcast": "10.88.111.255", "gateway": "10.88.111.254", "interface": "eth0", "macaddress": "bc:24:11:41:88:12", "mtu": 1500, "netmask": "255.255.252.0", "network": "10.88.108.0", "prefix": "22", "type": "ether"}, "discovered_interpreter_python": "/usr/bin/python3"}, "ansible_loop_var": "item", "changed": false, "failed": false, "invocation": {"module_args": {"fact_path": "/etc/ansible/facts.d", "filter": ["ansible_default_ipv4"], "gather_subset": ["!all", "network"], "gather_timeout": 10}}, "item": "k8s1.local"}, {"ansible_loop_var": "item", "item": "k8s3.local", "msg": "Data could not be sent to remote host \"k8s3.local\". Make sure this host can be reached over ssh: ssh: connect to host k8s3.local port 22: Connection timed out\r\n", "unreachable": true}, {"ansible_facts": {"ansible_default_ipv4": {"address": "10.88.111.30", "alias": "eth0", "broadcast": "10.88.111.255", "gateway": "10.88.111.254", "interface": "eth0", "macaddress": "bc:24:11:be:42:a6", "mtu": 1500, "netmask": "255.255.252.0", "network": "10.88.108.0", "prefix": "22", "type": "ether"}, "discovered_interpreter_python": "/usr/bin/python3"}, "ansible_loop_var": "item", "changed": false, "failed": false, "invocation": {"module_args": {"fact_path": "/etc/ansible/facts.d", "filter": ["ansible_default_ipv4"], "gather_subset": ["!all", "network"], "gather_timeout": 10}}, "item": "k8s2.local"}]}
...ignoring

NO MORE HOSTS LEFT **********************************************************************************************************************************************************************************************

PLAY RECAP ******************************************************************************************************************************************************************************************************
k8s1.local                 : ok=1    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=1

After my PR:

details
PLAY [Prepare nodes for upgrade] ********************************************************************************************************************************************************************************

TASK [kubespray-defaults : Determine reachable hosts] ***********************************************************************************************************************************************************
ok: [k8s1.local] => (item=k8s1.local)
[WARNING]: Unhandled error in Python interpreter discovery for host k8s1.local: Failed to connect to the host via ssh: ssh: connect to host k8s3.local port 22: Connection timed out
failed: [k8s1.local -> k8s3.local] (item=k8s3.local) => {"ansible_loop_var": "item", "item": "k8s3.local", "msg": "Data could not be sent to remote host \"k8s3.local\". Make sure this host can be reached over ssh: ssh: connect to host k8s3.local port 22: Connection timed out\r\n", "unreachable": true}
ok: [k8s1.local -> k8s2.local] => (item=k8s2.local)
fatal: [k8s1.local -> {{ item }}]: UNREACHABLE! => {"changed": false, "msg": "All items completed", "results": [{"ansible_loop_var": "item", "changed": false, "failed": false, "invocation": {"module_args": {"data": "pong"}}, "item": "k8s1.local", "ping": "pong"}, {"ansible_loop_var": "item", "item": "k8s3.local", "msg": "Data could not be sent to remote host \"k8s3.local\". Make sure this host can be reached over ssh: ssh: connect to host k8s3.local port 22: Connection timed out\r\n", "unreachable": true}, {"ansible_loop_var": "item", "changed": false, "failed": false, "invocation": {"module_args": {"data": "pong"}}, "item": "k8s2.local", "ping": "pong"}]}
...ignoring

TASK [kubespray-defaults : Gather ansible_default_ipv4 from reachable hosts] ************************************************************************************************************************************
ok: [k8s1.local] => (item=k8s1.local)
ok: [k8s1.local -> k8s2.local] => (item=k8s2.local)

TASK [kubespray-defaults : Create fallback_ips_base] ************************************************************************************************************************************************************
ok: [k8s1.local -> localhost]

TASK [kubespray-defaults : Set fallback_ips] ********************************************************************************************************************************************************************
ok: [k8s1.local]
ok: [k8s2.local]
ok: [k8s3.local]

TASK [Task that runs after] *************************************************************************************************************************************************************************************
ok: [k8s1.local] => {
    "fallback_ips": {
        "k8s1.local": "10.88.111.29",
        "k8s2.local": "10.88.111.30",
        "k8s3.local": "127.0.0.1"
    }
}
ok: [k8s3.local] => {
    "fallback_ips": {
        "k8s1.local": "10.88.111.29",
        "k8s2.local": "10.88.111.30",
        "k8s3.local": "127.0.0.1"
    }
}
ok: [k8s2.local] => {
    "fallback_ips": {
        "k8s1.local": "10.88.111.29",
        "k8s2.local": "10.88.111.30",
        "k8s3.local": "127.0.0.1"
    }
}

PLAY RECAP ******************************************************************************************************************************************************************************************************
k8s1.local                 : ok=5    changed=0    unreachable=0    failed=0    skipped=3    rescued=0    ignored=1
k8s2.local                 : ok=2    changed=0    unreachable=0    failed=0    skipped=2    rescued=0    ignored=0
k8s3.local                 : ok=2    changed=0    unreachable=0    failed=0    skipped=2    rescued=0    ignored=0

Also to note, I originally tried making this change to add ignore_errors: true :

details
#
 roles/kubespray-defaults/tasks/fallback_ips.yml

 ---
 # Set 127.0.0.1 as fallback IP if we do not have host facts for host
 # ansible_default_ipv4 isn't what you think.
 # Thanks https://medium.com/opsops/ansible-default-ipv4-is-not-what-you-think-edb8ab154b10

 - name: Gather ansible_default_ipv4 from all hosts
   setup:
     gather_subset: '!all,network'
     filter: "ansible_default_ipv4"
   delegate_to: "{{ item }}"
   delegate_facts: yes
   when: hostvars[item].ansible_default_ipv4 is not defined
   loop: "{{ (groups['k8s_cluster'] | default([]) + groups['etcd'] | default([]) + groups['calico_rr'] | default([])) | unique }}"
   run_once: yes
   ignore_unreachable: true
+
  ignore_errors: true

   tags: always

 - name: Create fallback_ips_base
   set_fact:
     fallback_ips_base: |
       ---
       {% for item in (groups['k8s_cluster'] | default([]) + groups['etcd'] | default([]) + groups['calico_rr'] | default([])) | unique %}
-
      {% set found = hostvars[item].get('ansible_default_ipv4') %}

+
      {% set found = hostvars[item].get('ansible_default_ipv4', {}) %}

       {{ item }}: "{{ found.get('address', '127.0.0.1') }}"
       {% endfor %}
   delegate_to: localhost
   connection: local
   delegate_facts: yes
   become: no
   run_once: yes

 - name: Set fallback_ips
   set_fact:
     fallback_ips: "{{ hostvars.localhost.fallback_ips_base | from_yaml }}"
PLAY [Prepare nodes for upgrade] ********************************************************************************************************************************************************************************

TASK [kubespray-defaults : Gather ansible_default_ipv4 from all hosts] ******************************************************************************************************************************************
ok: [k8s1.local] => (item=k8s1.local)
[WARNING]: Unhandled error in Python interpreter discovery for host k8s1.local: Failed to connect to the host via ssh: ssh: connect to host k8s3.local port 22: Connection timed out
failed: [k8s1.local -> k8s3.local] (item=k8s3.local) => {"ansible_loop_var": "item", "item": "k8s3.local", "msg": "Data could not be sent to remote host \"k8s3.local\". Make sure this host can be reached over ssh: ssh: connect to host k8s3.local port 22: Connection timed out\r\n", "unreachable": true}
ok: [k8s1.local -> k8s2.local] => (item=k8s2.local)
fatal: [k8s1.local -> {{ item }}]: UNREACHABLE! => {"changed": false, "msg": "All items completed", "results": [{"ansible_facts": {"ansible_default_ipv4": {"address": "10.88.111.29", "alias": "eth0", "broadcast": "10.88.111.255", "gateway": "10.88.111.254", "interface": "eth0", "macaddress": "bc:24:11:41:88:12", "mtu": 1500, "netmask": "255.255.252.0", "network": "10.88.108.0", "prefix": "22", "type": "ether"}, "discovered_interpreter_python": "/usr/bin/python3"}, "ansible_loop_var": "item", "changed": false, "failed": false, "invocation": {"module_args": {"fact_path": "/etc/ansible/facts.d", "filter": ["ansible_default_ipv4"], "gather_subset": ["!all", "network"], "gather_timeout": 10}}, "item": "k8s1.local"}, {"ansible_loop_var": "item", "item": "k8s3.local", "msg": "Data could not be sent to remote host \"k8s3.local\". Make sure this host can be reached over ssh: ssh: connect to host k8s3.local port 22: Connection timed out\r\n", "unreachable": true}, {"ansible_facts": {"ansible_default_ipv4": {"address": "10.88.111.30", "alias": "eth0", "broadcast": "10.88.111.255", "gateway": "10.88.111.254", "interface": "eth0", "macaddress": "bc:24:11:be:42:a6", "mtu": 1500, "netmask": "255.255.252.0", "network": "10.88.108.0", "prefix": "22", "type": "ether"}, "discovered_interpreter_python": "/usr/bin/python3"}, "ansible_loop_var": "item", "changed": false, "failed": false, "invocation": {"module_args": {"fact_path": "/etc/ansible/facts.d", "filter": ["ansible_default_ipv4"], "gather_subset": ["!all", "network"], "gather_timeout": 10}}, "item": "k8s2.local"}]}
...ignoring

TASK [kubespray-defaults : Create fallback_ips_base] ************************************************************************************************************************************************************
ok: [k8s1.local -> localhost]

TASK [kubespray-defaults : Set fallback_ips] ********************************************************************************************************************************************************************
ok: [k8s1.local]
ok: [k8s3.local]
ok: [k8s2.local]

TASK [Task that runs after] *************************************************************************************************************************************************************************************
ok: [k8s1.local] => {
    "fallback_ips": {
        "k8s1.local": "127.0.0.1",
        "k8s2.local": "127.0.0.1",
        "k8s3.local": "127.0.0.1"
    }
}
ok: [k8s3.local] => {
    "fallback_ips": {
        "k8s1.local": "127.0.0.1",
        "k8s2.local": "127.0.0.1",
        "k8s3.local": "127.0.0.1"
    }
}
ok: [k8s2.local] => {
    "fallback_ips": {
        "k8s1.local": "127.0.0.1",
        "k8s2.local": "127.0.0.1",
        "k8s3.local": "127.0.0.1"
    }
}

PLAY RECAP ******************************************************************************************************************************************************************************************************
k8s1.local                 : ok=4    changed=0    unreachable=0    failed=0    skipped=3    rescued=0    ignored=1
k8s2.local                 : ok=2    changed=0    unreachable=0    failed=0    skipped=2    rescued=0    ignored=0
k8s3.local                 : ok=2    changed=0    unreachable=0    failed=0    skipped=2    rescued=0    ignored=0

But as you can see, while it got past the early exit, it looks like it doesn't save the discovered facts. And so the output is all wrong. That's why I decided to first filter out the unreachable hosts in a separate task before running setup .

Does this PR introduce a user-facing change? :

None

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 13, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @Rickkwa . Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here .

Instructions for interacting with me using PR comments are available here . If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Mar 13, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Rickkwa
Once this PR has been reviewed and has the lgtm label , please assign oomichi for approval. For more information see the Kubernetes Code Review Process .

The full list of commands accepted by this bot can be found here .

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cyclinder
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 13, 2024
@Rickkwa
Copy link
Contributor Author

Do you need any more info from me to move this forward? I'm hoping this can get into the 2.25 release, whenever that is.

Sign up for free to join this conversation on GitHub . Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fallback_ips.yml exits early when there is an unreachable host in the inventory
3 participants
- "漢字路" 한글한자자동변환 서비스는 교육부 고전문헌국역지원사업의 지원으로 구축되었습니다.
- "漢字路" 한글한자자동변환 서비스는 전통문화연구회 "울산대학교한국어처리연구실 옥철영(IT융합전공)교수팀"에서 개발한 한글한자자동변환기를 바탕하여 지속적으로 공동 연구 개발하고 있는 서비스입니다.
- 현재 고유명사(인명, 지명등)을 비롯한 여러 변환오류가 있으며 이를 해결하고자 많은 연구 개발을 진행하고자 하고 있습니다. 이를 인지하시고 다른 곳에서 인용시 한자 변환 결과를 한번 더 검토하시고 사용해 주시기 바랍니다.
- 변환오류 및 건의,문의사항은 juntong@juntong.or.kr로 메일로 보내주시면 감사하겠습니다. .
Copyright ⓒ 2020 By '전통문화연구회(傳統文化硏究會)' All Rights reserved.
 한국   대만   중국   일본