Nacos进阶一之Nacos动态配置不生效故障排查

问题描述

我们开发了一个新的项目,项目核心功能包括对外提供的API接口 和 内部的定时任务。提供的API服务部署在4台服务器上,内部的定时任务部署在另外的2台服务器上。项目中使用了一些会变化的配置信息,这些配置信息,使用Nacos的配置中心管理。

修改Nacos中的配置信息,API网关服务配置会动态变化,但定时任务的服务的动态配置没有生效。这个问题排查,让我有点怀疑人生。下面列出我的排查步骤。

Nacos配置信息

Nacos的配置信息在 bootstrap.yml 中配置。bootstrap.yml 用来程序引导时执行,应用于更加早期配置信息读取。可以理解成系统级别的一些参数配置,这些参数一般是不会变动的。一旦bootStrap.yml 被加载,则内容不会被覆盖。

配置如下:

spring:
  application:
    name: order-web

##下面是环境区分,主要不同环境不同文件获取
---
#测试环境
spring:
  profiles: beta
  #nacos
  cloud:
    nacos:
      discovery:
        server-addr: 172.0.0.1:8848
        namespace: 21406c22-abef-4472-953e-tyea2aeb167a
        username: nacos
        password: nacos
      config:
        server-addr: 172.0.0.1:8848
        username: nacos
        password: nacos
        namespace: 21406c22-abef-4472-953e-tyea2aeb167a
        group: DEFAULT_GROUP
        shared-configs:
          - data-id: common-kafka.yaml
            group: DEFAULT_GROUP
            refresh: true

          - data-id: common-xxl-job.yaml
            group: DEFAULT_GROUP
            refresh: true

          - data-id: common-redis-order.yaml
            group: DEFAULT_GROUP
            refresh: true

          - data-id: common-mysql-order.yaml
            group: DEFAULT_GROUP
            refresh: true
       
        extension-configs:
          - data-id: order-config.yaml
            group: DEFAULT_GROUP
            refresh: true
---
#本地环境
spring:
  profiles: local
  #nacos
  cloud:
    nacos:
      discovery:
        server-addr: 172.0.0.1:8848
        namespace: 21406c22-abef-4472-953e-tyea2aeb167b
        username: nacos
        password: nacos
      config:
        server-addr: 172.0.0.1:8848
        username: nacos
        password: nacos
        namespace: 21406c22-abef-4472-953e-tyea2aeb167b
        group: DEFAULT_GROUP
        shared-configs:
          - data-id: common-kafka.yaml
            group: DEFAULT_GROUP
            refresh: true

          - data-id: common-xxl-job.yaml
            group: DEFAULT_GROUP
            refresh: true

          - data-id: common-redis-order.yaml
            group: DEFAULT_GROUP
            refresh: true

          - data-id: common-mysql-order.yaml
            group: DEFAULT_GROUP
            refresh: true
       
        extension-configs:
          - data-id: order-config.yaml
            group: DEFAULT_GROUP
            refresh: true
---
#正式环境
spring:
  profiles: prod
  #nacos
  cloud:
    nacos:
      discovery:
        server-addr: 172.0.0.1:8848
        namespace: 21406c22-abef-4472-953e-tyea2aeb167c
        username: nacos
        password: nacos
      config:
        server-addr: 172.0.0.1:8848
        username: nacos
        password: nacos
        namespace: 21406c22-abef-4472-953e-tyea2aeb167c
        group: DEFAULT_GROUP
        shared-configs:
          - data-id: common-kafka.yaml
            group: DEFAULT_GROUP
            refresh: true

          - data-id: common-xxl-job.yaml
            group: DEFAULT_GROUP
            refresh: true

          - data-id: common-redis-order.yaml
            group: DEFAULT_GROUP
            refresh: true

          - data-id: common-mysql-order.yaml
            group: DEFAULT_GROUP
            refresh: true
       
        extension-configs:
          - data-id: order-config.yaml
            group: DEFAULT_GROUP
            refresh: true

排查步骤

1、spring.application.name 放在bootstrap配置文件中

定义的 spring.application.name 配置在bootstrap.yml文件中,满足条件。

2、refresh 配置成 true

NacosConfigProperties 的refreshEnabled 默认值为 true,无须配置。shared-configs 和 extension-configs 中 refresh 配置须为true。我们配置的也没有问题。

refresh-enabled: true

3、通过添加打印日志排查

配置Nacos的打印日志,搜索 “ Refresh Nacos config group ”为空

logging:
  level:
    com:
      alibaba:
        nacos: DEBUG

NacosContextRefresher类定义如下:

public class NacosContextRefresher implements ApplicationListener, ApplicationContextAware {
  public void onApplicationEvent(ApplicationReadyEvent event) {
        if (this.ready.compareAndSet(false, true)) {
            this.registerNacosListenersForApplications();
        }
    }
  
     private void registerNacosListenersForApplications() {
        if (this.isRefreshEnabled()) {
            Iterator var1 = NacosPropertySourceRepository.getAll().iterator();

            while(var1.hasNext()) {
                NacosPropertySource propertySource = (NacosPropertySource)var1.next();
                if (propertySource.isRefreshable()) {
                    String dataId = propertySource.getDataId();
                    this.registerNacosListener(propertySource.getGroup(), dataId);
                }
            }
        }
    }

    private void registerNacosListener(final String groupKey, final String dataKey) {
        String key = NacosPropertySourceRepository.getMapKey(dataKey, groupKey);
        Listener listener = (Listener)this.listenerMap.computeIfAbsent(key, (lst) -> {
            return new AbstractSharedListener() {
                public void innerReceive(String dataId, String group, String configInfo) {
                    NacosContextRefresher.refreshCountIncrement();
                    NacosContextRefresher.this.nacosRefreshHistory.addRefreshRecord(dataId, group, configInfo);
                    NacosContextRefresher.this.applicationContext.publishEvent(new RefreshEvent(this, (Object)null, "Refresh Nacos config"));
                    if (NacosContextRefresher.log.isDebugEnabled()) {
                        NacosContextRefresher.log.debug(String.format("Refresh Nacos config group=%s,dataId=%s,configInfo=%s", group, dataId, configInfo));
                    }
                }
            };
        });

        try {
            this.configService.addListener(dataKey, groupKey, listener);
        } catch (NacosException var6) {
            log.warn(String.format("register fail for nacos listener ,dataId=[%s],group=[%s]", dataKey, groupKey), var6);
        }
    }
}

是什么原因导致 ApplicationListener 事件注册失败呢?

4、梳理 spring boot的启动流程

spring boot的核心类SpringApplication

public class SpringApplication {
  public ConfigurableApplicationContext run(String... args) {
        StopWatch stopWatch = new StopWatch();
        stopWatch.start();
        ConfigurableApplicationContext context = null;
        Collection exceptionReporters = new ArrayList();
        this.configureHeadlessProperty();
        SpringApplicationRunListeners listeners = this.getRunListeners(args);
        listeners.starting();

        Collection exceptionReporters;
        try {
            ApplicationArguments applicationArguments = new DefaultApplicationArguments(args);
            ConfigurableEnvironment environment = this.prepareEnvironment(listeners, applicationArguments);
            this.configureIgnoreBeanInfo(environment);
            Banner printedBanner = this.printBanner(environment);
            context = this.createApplicationContext();
            exceptionReporters = this.getSpringFactoriesInstances(SpringBootExceptionReporter.class, new Class[]{ConfigurableApplicationContext.class}, context);
            this.prepareContext(context, environment, listeners, applicationArguments, printedBanner);
            this.refreshContext(context);
            this.afterRefresh(context, applicationArguments);
            stopWatch.stop();
            if (this.logStartupInfo) {
                (new StartupInfoLogger(this.mainApplicationClass)).logStarted(this.getApplicationLog(), stopWatch);
            }

            listeners.started(context);
            this.callRunners(context, applicationArguments);
        } catch (Throwable var10) {
            this.handleRunFailure(context, var10, exceptionReporters, listeners);
            throw new IllegalStateException(var10);
        }

        try {
            listeners.running(context);
            return context;
        } catch (Throwable var9) {
            this.handleRunFailure(context, var9, exceptionReporters, (SpringApplicationRunListeners)null);
            throw new IllegalStateException(var9);
        }
    }
}

发现SpringApplication的run()中有一行callRunners(context, applicationArguments); 这个方法内部代码使用主线程执行实现ApplicationRunner和CommandLineRunner的类中的代码,如果这些类中有阻塞,spring就不会执行。

经过上面的分析,可以确定问题了,项目中有些类实现了ApplicationRunner,同时有while(true)的代码,从而导致主线程阻塞在这里。排查我们的代码,如我们预测一样。

@Slf4j
@Component
public class PullIncomeSubscriber implements ApplicationRunner {

    @Override
    public void run(ApplicationArguments args) throws Exception {
        doBusiness();
    }

    private void doBusiness() {
        while (true) {
            try {
                this.execute();
            } catch (Exception ex) {
                log.error("PullIncomeSubscriber.execute", ex);
                AlterFunction.sendMsg(AlterCodeEnum.AD_TRACK, "收益拉取任务异常:" + ex.getMessage());
            }
        }
    }

    public void execute() throws Exception {
        PullIncomTask pull = pullIncomeTaskCache.pull();

        if (Objects.isNull(pull)) {
            Thread.sleep(30000);
            return;
        }

        log.info("PullIncomeSubscriber.execute#adPlatfrom={}", pull.getAdPlatform());
        // 业务逻辑
    }
}

修改为异步线程执行,问题彻底解决。

@Slf4j
@Component
public class PullIncomeSubscriber implements ApplicationRunner {
    private final ExecutorService pool = Executors.newSingleThreadExecutor(); 

		@Override
    public void run(ApplicationArguments args) throws Exception {
        pool.execute(this::doBusiness);
    }
}