MATLAB, 강화학습을 이용한 급수 시스템 스케쥴링 실습

이번 포스팅은 간단한 조합 문제를 강화학습으로 풀어봅시다. 해당 내용은 MATLAB 도움말 을 번역했을 뿐이니, 정확한 내용은 MATLAB 강화학습 툴박스 를 참고하여 주세요.

1. Water Distribution System Scheduling Using Reinforcement Learning

이번 예제에서는 어떻게 강화학습(Reinforcement Learning)을 통해 급수시스템에 대해 최적화된 펌프 스케쥴링 정책(policy)을 학습하는지 보여줍니다.

1.0. Local Functions

MATLAB에서는 이게 제일 위에 올라가 있어야 하나씩 따라할 수 있는데 원본 사이트에서는 제일 밑에 내려가 있습니다. 이걸 따라하는 유저 입장에서는

한심좌

앞으로 진행할 스크립트들이 알아서 돌아가게 하게끔 지역 함수(Local Functions)부터 미리 정의합시다.

Water Demand Function

function [WaterDemand,T_max] = generateWaterDemand(num_days)

    t = 0:(num_days*24)-1; % hr
    T_max = t(end);

    Demand_mean = [28, 28, 28, 45, 55, 110, 280, 450, 310, 170, 160, 145, 130, ...
        150, 165, 155, 170, 265, 360, 240, 120, 83, 45, 28]'; % m^3/hr

    Demand = repmat(Demand_mean,1,num_days);
    Demand = Demand(:);

    % Add noise to demand
    a = -25; % m^3/hr
    b = 25; % m^3/hr
    Demand_noise = a + (b-a).*rand(numel(Demand),1);

    WaterDemand = timeseries(Demand + Demand_noise,t);
    WaterDemand.Name = "Water Demand";
end

Scheduling problem 에서 입력에 해당하는 Demand 와 관련된 함수입니다. repmat 은 배열 혹은 스칼라(첫번째 인자)를 입력으로 받아 복사해주는 함수입니다. 아마 Repeat matrix 라 생각합니다. Demand_mean 벡터를 지정된 일 수 만큼 반복하여 Demand 를 만들고 적당한 잡은(noise) 를 만들어줍니다.

Reset Function

function in = localResetFcn(in)

    % Use a persistent random seed value to evaluate the agent and the baseline
    % controller under the same conditions.
    persistent randomSeed
    if isempty(randomSeed)
        randomSeed = 0;
    end
    if strcmp(in,"Reset seed")
        randomSeed = 0;
        return
    end    
    randomSeed = randomSeed + 1;
    rng(randomSeed)
    
    % Randomize water demand.
    num_days = 4;
    H_max = 7;
    [WaterDemand,~] = generateWaterDemand(num_days);
    assignin('base','WaterDemand',WaterDemand)

    % Randomize initial height.
    h0 = 3*randn;
    while h0 <= 0 || h0 >= H_max
        h0 = 3*randn;
    end
    blk = 'watertankscheduling/Water Tank System/Initial Water Height';

    in = setBlockParameter(in,blk,'Value',num2str(h0));

end

MATLAB 은 포인터 전달(Pass by reference) 기능이 아쉽습니다. 반환으로만 전달하죠. localResetFcn 을 통해 Demand 를 새로 만들고, setBlockParameter 에 넣습니다. Block parameter 는 아래를 나타내며 이 또한 MATLAB 도움말을 참고했습니다.

Block parameter 는 생성된 코드에서 변수로 나타냅니다. 아래와 같은 특성을 가진 변수들의 모양을 제어합니다.

생성된 코드의 변수를 인라인 확장(Inlining)합니다.
Function block, 전역 변수, 아직 정의되지 않은 변수에 마킹합니다.

여기서 얘기하는 것은 만들어지는 환경 설정에 필요한 변수들 정도로 이해할 수 있습니다. 이제 본격적으로 시작할게요.

1.1. Water Distribution system

먼저 현재 급수 시스템은 아래 그림을 따르고 있습니다.

system diagram

위 Diagram 에 대한 용어(Terminology) 설명입니다.

\[\begin{array}{l} \cdot \ Q_{\text{Supply}}\text{ 는 저장소로부터 물탱크로 공급하는 물의 양입니다.}\\ \cdot \ Q_{\text{Demand}}\text{ 는 요구하는 사용량에 맞춰 탱크 밖으로 나오는 물의 양입니다.} \\ \end{array}\]

강화학습 Agent 의 목적은 효율적인 스케쥴링입니다. 에너지 사용량을 최소화하며, 수요를 충족시키는 것이죠. 탱크 시스템의 물리법칙은 아래 조건에 의해 운영됩니다.

\[\begin{array}{l} A\frac{dh}{dt}=Q_{\text{Supply}}(t)-Q_{\text{Demand}}(t) \\ \text{여기서, }A=40m^2, h_\max=7m. \text{으로 24시간 수요에 대한 시간 함수는 아래와 같습니다.} \\ Q_{\text{Demand}}(t) = \mu(t) +\eta(t)\\ \mu(t)\text{는 예상되는 요구사항이고 }\eta(t) \text{ 불확실한 요구사항을 표현합니다.}\\ \text{그리고 샘플링은 균등 분포를 기반으로 진행합니다.}\\ \text{그림에서 펌프의 수는 최대 3개입니다. 0, 1, 2, 3을 선택할 수 있죠. } a \in \{0, 1, 2, 3\}\\ \text{공급은 돌아가는 펌프의 수에 따라 아래의 식에 매핑됩니다.}\\ Q_{\text{Supply}}(t)=Q(a)= \begin{cases} 0, & \mbox{a=0} \\ 164, & \mbox{a=1} \\ 279, & \mbox{a=2} \\ 344, & \mbox{a=3} \\ \end{cases} \frac{\text{cm}}{\text{h}}\\ \text{문제를 단순하게 하기 위해서, 전력 소비량에 초점을 맞췄습니다.}\\ \text{이 환경에서 보상함수는 아래처럼 정의됩니다. 탱크에서 물이 비거나 범람하는 경우를 피하기 위해서,}\\ \text{만약 물 높이가 탱크의 용량 하한, 상한에 가까워지면 추가적인 비용조건이 붙습니다}\\ r(h,\ a)=-10(h\ge(h_\max-0.1))-10(h\le0.1)-a\\ \text{물이 넘칠라 그러면 -10, 물이 0.1보다 적어도 -10}\\ \text{둘 다 잘 지키면서 아니면 동작하는 펌프가 적을수록 좋습니다.} \end{array}\]

1.2. Generate Demand Profile

Demand Profile 을 만들어 봅시다. 고려한 날짜 수를 기준으로 수요 프로파일을 작성할 때는 generateWaterDemand 지역 함수를 이용합니다. (위에 있었죠.) 예제에서는 4일로 설정했네요.

num_days = 4; % Number of days
[WaterDemand,T_max] = generateWaterDemand(num_days);

Demand Profile 은 시각화 할 수 있습니다.

plot(WaterDemand)

plot

1.3. Open and Configure Model

급수시스템은 Simulink 에 미리 정의된 모델입니다. 그래서 시스템을 불러올 수 있습니다.

mdl = "watertankscheduling";
open_system(mdl)

system diagram

강화학습 agent 말고도 MATLAB 함수 블록 제어 룰에 의한 간단한 기준 컨트롤러도 있습니다. 이 컨트롤러는 물의 높이에 의존하여 펌프를 작동합니다.

최초 물의 높이를 설정하고

h0 = 3; % m

모델 파라미터를 정합니다.

SampleTime = 0.2;
H_max = 7; % Max tank height (m)
A_tank = 40; % Area of tank (m^2)

1.4. Create Environment Interface for RL Agent

Simulink 모델과 인터페이스하는 환경을 만들기 위해서 먼저 action, observation 에 대해 정의합니다. actInfo 과 obsInfo 에 담겨져있죠. agent action 은 펌프의 수를 선택하고 agent observation 은 연속된 시간 속에서의 물의 높이를 측정합니다.

actInfo = rlFiniteSetSpec([0,1,2,3]);
obsInfo = rlNumericSpec([1,1]);

환경 인터페이스를 만들어봅시다. (actInfo, obsInfo 를 넣어서요.)

env = rlSimulinkEnv(mdl,mdl+"/RL Agent",obsInfo,actInfo);

위에 지역함수로 놔뒀던 리셋함수를 환경 리셋함수로서 넣어줍니다. 위에서 설명했던대로 랜덤한 수위(물의 높이) 초기값, 물에 대한 요구조건이 정해지고 각 에피소드마다 랜덤으로 주어집니다.

env.ResetFcn = @(in)localResetFcn(in);

1.5. Create DQN Agent

이제 DQN 의 시간입니다. 본격적인 학습이 들어가죠. DQN(Deep Q Network) 는 agent approximates 로서 주어진 observations 와 action 으로 Q value function 에 대해 평가합니다. 그래서 평가에 대한 Deep neural network 를 구성하고 만들어진 네트워크에서 평가하는 추가적인 설명은 해당 링크에서 볼 수 있습니다. Create Policy and Value Function Representations.

% Fix the random generator seed for reproducibility.
rng(0);

useLSTM 를 true로 두어 LSTM 의 순환신경망 형태로 레이어를 구성할 수 있고, false 로 두어 일반적인 Fully connected layer 로도 구성할 수 있습니다.

useLSTM = false;
if useLSTM
    layers = [
        sequenceInputLayer(obsInfo.Dimension(1),"Name","state","Normalization","none")
        fullyConnectedLayer(32,"Name","fc_1")
        reluLayer("Name","relu_body1")
        lstmLayer(32,"Name","lstm")
        fullyConnectedLayer(32,"Name","fc_3")
        reluLayer("Name","relu_body3")
        fullyConnectedLayer(numel(actInfo.Elements),"Name","output")];
else
    layers = [
        featureInputLayer(obsInfo.Dimension(1),"Name","state","Normalization","none")
        fullyConnectedLayer(32,"Name","fc_1")
        reluLayer("Name","relu_body1")
        fullyConnectedLayer(32,"Name","fc_2")
        reluLayer("Name","relu_body2")
        fullyConnectedLayer(32,"Name","fc_3")
        reluLayer("Name","relu_body3")
        fullyConnectedLayer(numel(actInfo.Elements),"Name","output")];
end

critic 에 대한 설명입니다. gpu , 엔비디아 계열의 그래픽 카드를 사용하고 병렬처리 툴박스가 있다면 아래 주석을 풀고 사용하면 좀 더 빠릅니다.

criticOpts = rlRepresentationOptions('LearnRate',0.001,'GradientThreshold',1);
% criticOpts = rlRepresentationOptions('LearnRate',0.001,'GradientThreshold',1, 'UseDevice', 'gpu');

이제 위에 만들어진 레이어부터 다 넣어줍니다.

critic = rlQValueRepresentation(layerGraph(layers),obsInfo,actInfo,...
    'Observation',{'state'},criticOpts);

1.6. Create DQN Agent

LSTM 을 사용한다면 시퀀스 길이는 20이나, FC 레이어를 사용한다면 별다른 시퀀스없이 1로 설정합니다.

opt = rlDQNAgentOptions('SampleTime',SampleTime); 
if useLSTM
    opt.SequenceLength = 20;
else
    opt.SequenceLength = 1;
end
opt.DiscountFactor = 0.995;
opt.ExperienceBufferLength = 1e6;
opt.EpsilonGreedyExploration.EpsilonDecay = 1e-5;
opt.EpsilonGreedyExploration.EpsilonMin = .02;

DQN agent 에 critic 과 option 을 넣어줍니다.

agent = rlDQNAgent(critic,opt);    

1.7. Train Agent

Agent 학습에 대해서는 여느 다른 인공지능 학습처럼 Training option 또한 설정합니다.

trainOpts = rlTrainingOptions(...
    'MaxEpisodes',1000, ...
    'MaxStepsPerEpisode',ceil(T_max/SampleTime), ...
    'Verbose',false, ...
    'Plots','training-progress',...
    'StopTrainingCriteria','EpisodeCount',...
    'StopTrainingValue',1000,...
    'ScoreAveragingWindowLength',100);

훈련중에 추가적인 저장을 할 수 있습니다. 필요하면 사용하세요.

% Save agents using SaveAgentCriteria if necessary
trainOpts.SaveAgentCriteria = 'EpisodeReward';
trainOpts.SaveAgentValue = -42;

doTraining 을 true 로 넣으면, 직접 합슬을 하고 false 를 놓으면 이미 학습된 DQN 을 불러옵니다.

doTraining = true;
if doTraining
    % Train the agent.
    trainingStats = train(agent,env,trainOpts);
else
    % Load the pretrained agent for the example.
    load('SimulinkWaterDistributionDQN.mat','agent')
end

학습 화면은 아래와 같이 나옵니다.

progress

1.8. Simulate DQN Agent

학습된 agent 의 성능 검증을 위해서 시뮬레이션을 하겠습니다. 이에대해 더 자세한 설명은 옆 링크를 참조해주세요. rlSimulationOptions and sim.

껏다 켰다 가능한 스위치를 추가하는 작업입니다.

set_param(mdl+"/Manual Switch",'sw','0');

리셋하고 시뮬레이션 하는 것을 30번 반복하도록 설정합니다.

NumSimulations = 30;
simOptions = rlSimulationOptions('MaxSteps',T_max/SampleTime,...
    'NumSimulations', NumSimulations);

위에 언급한 간단한 기준 컨트롤러와의 성능비교를 위해서 아래를 준비합니다.

env.ResetFcn("Reset seed");

시뮬레이션을 시작해봅시다.

experienceDQN = sim(env,agent,simOptions);

1.9. Simulate Baseline Controller

DQN agent 와 기준컨트롤러와의 비교를 위해서 기준 컨트롤러도 시뮬레이션이 필요합니다. 역시나 위와 같은 절차대로 소스를 작성하겠습니다.

set_param(mdl+"/Manual Switch",'sw','1');
env.ResetFcn("Reset seed");
experienceBaseline = sim(env,agent,simOptions);

1.10. Compare DQN Agent with Baseline Controller

agent 와 기준 컨트롤러의 누적된 보상 결과 벡터를 초기화합니다.

resultVectorDQN = zeros(NumSimulations, 1);
resultVectorBaseline = zeros(NumSimulations,1);

그리고 보상을 누적합니다.

for ct = 1:NumSimulations
    resultVectorDQN(ct) = sum(experienceDQN(ct).Reward);
    resultVectorBaseline(ct) = sum(experienceBaseline(ct).Reward);
end

시각화 하구요.

plot([resultVectorDQN resultVectorBaseline],'o')
set(gca,'xtick',1:NumSimulations)
xlabel("Simulation number")
ylabel('Cumulative Reward')
legend('DQN','Baseline','Location','NorthEastOutside')