This is about make a RL agent to do the control of double pendulum (Pendubot) and achieve the swing up and balancing problem with parameters uncertainty.
Here are some videos that describe how our agent was able to swing up and balance the Pendubot, with parameters uncertainty using Greedy-Divide and Conquer algorithm:
in all teh videos the models were trained only on these parameters:
and we will test several cases
In this case we test the original configurations ↓:
1.1.160.160.mp4
In this case
1.1.2.1.1.160.2.160.mp4
In this case the second mass
1.1.2.2.1.160.2.160.mp4
In this case the first mass
m1.2kg.mp4
In this case the first mass
m1.3kg.mp4
In this case the second mass
m2.2.5kg.mp4
In this case all the parameters are randomised
Rand1.mp4
In this case all the parameters are randomised
Rand2.mp4
- rotli>=1.0.9
- ConfigParser>=5.3.0
- cryptography>=38.0.3
- Cython>=0.29.32
- dl>=0.1.0
- docutils>=0.19
- gym>=0.21.0
- HTMLParser>=0.0.2
- importlib_metadata>=4.13.0
- ipaddr>=2.2.0
- keyring>=23.11.0
- lockfile>=0.12.2
- lxml>=4.9.1
- matplotlib>=3.6.1
- mypy_extensions>=0.4.3
- numpy>=1.23.4
- opencv_python>=4.6.0.66
- ordereddict>=1.1
- protobuf>=4.21.9
- pyOpenSSL>=22.1.0
- scipy>=1.7.1
- stable_baselines3>=1.6.2
- typing_extensions>=4.4.0
- wincertstore>=0.2.1
- xmlrpclib>=1.0.1
- zipp>=3.10.0
All the libraries can be pip installed using python3 -m pip install -r requirements.txt
- Clone this repo (for help see this tutorial).
- Navigate to repository folder
- Install dependencies which are specified in requirements.txt. use
python3 -m pip install -r requirements.txt
- Run
project.py
. - Run
Uncertainity.py
if you want to test the model if there is uncertainity in mass of the pendulum.
The project start with single pendulum, it is better to run it on the local machine, because cv2.imshow() won't work and will give an error. You can set the parameter of your own system, it is very clear how to do that:
env = Pendulum(m=m, L=L, I=I, b=b, dt=dt, mode='balance')
m # mass of the pendulum bob
L # length of pendulum bob
I # inertia of actuator
b # friction in actuator
g # gravity acceleration
dt # step size
theta # initial angle
dtheta # initial angular speed
mode # working mode ['balance', 'swing_up']
max_itr # maximum iteration of episode balance = 200, swing_up = 500
If the mode is set to balance the pedulum will have the following consumption:
- Start near the balance angle.
- Get +1 reward if agent maintain he angle of pendulum between [-12, 12].
- Terminate if get outside the angle range.
- Maximum episode iteration will be 200 by default and the maximum return will be 200.
If the mode is set to swing_up the pedulum will have the following consumption:
- Start near the down balance angle.
- reward = -(2theta^2 + 0.1d_theta^2 + 0.01*tourq^2).
- Terminate if the agent maintain the theta between [-12, 12] for time bigger than the half of maximum episode iteration.
- Maximum episode iteration will be 500 by default.
For testing the system you can run the system by taking a random action from action space and apply it to the system, to do that you can use the following code:
# Taking random actions and show the real time simulation
while True:
# Take a random action
action = env.action_space.sample()
obs, reward, done, info = env.step(action)
# Render the game
env.render(mode = "human")
if done == True:
break
cv2.waitKey(2000)
env.close()
After running the previous you will got the following result if the mode is set to balance
:
And the following result if the mode is set to swing_up
:
We have now to train the agent, depend on the mode you can set the maximum number of iteration for training, swing_up mode is more general but also need much more time to be trained than the balance mode:
- balance mode:
model = DummyVecEnv([lambda: env])
model = PPO('MlpPolicy', model, verbose = 1)
model.learn(total_timesteps=20000)
- swing_up mode:
model = DummyVecEnv([lambda: env])
model = PPO('MlpPolicy', model, verbose = 1)
model.learn(total_timesteps=200000)
To test the agent we first activate the continues running mode:
env.continues_run_mode = True
In this mode the system will interact with the user, whom can use the keyboard to apply outside disturbance to the system, the user can use the arrows to increase or decrease the amount of external tourque she/he wants to apply, and the direction also, and also he can exit by pressinf any other key.
# Evaluating the results of training
env.continues_run_mode = True
print(evaluate_policy(model, env, n_eval_episodes=1, render=True))
env.close()
i, up arrow : increase the external tourque
d, down arrow : decrease the external tourque
l, left arrow : apply the external tourque to the left
r, right arrow : apply the external tourque to the right
q, any other key : finish the testing
You can see the previous in the following window:
- balance mode We change the mass randomally (20%) then we evalute the model and we have success rate of 100%, and that is logical because the system is fully actuated and the only situation it could fail if the tourqe of the motor is not able to hold the mass.
- swing_up mode We change the mass randomally (20%) then we evalute the model and we have success rate of 100%, and that is logical because the system is fully actuated and the only situation it could fail if the tourqe of the motor is not able to hold the mass, but here it is very clear that the return is very connected to the value of the mass because the dynamics of the system will change, which mean the response of the system for any action will be diffferent, but the result is good enough and we won't made any improvment.
a double pendulum is a pendulum with another pendulum attached to its end, is a simple physical system that exhibits rich dynamic behavior with a strong sensitivity to initial conditions.
The motion of a double pendulum is governed by a set of coupled ordinary differential equations and is chaotic.
θ1'' = | −g (2 m1 + m2) sin θ1 − m2 g sin(θ1 − 2 θ2) − 2 sin(θ1 − θ2) m2 (θ2'2 L2 + θ1'2 L1 cos(θ1 − θ2)) |
L1 (2 m1 + m2 − m2 cos(2 θ1 − 2 θ2)) |
θ2'' = | 2 sin(θ1 − θ2) (θ1'2 L1 (m1 + m2) + g(m1 + m2) cos θ1 + θ2'2 L2 m2 cos(θ1 − θ2)) |
L2 (2 m1 + m2 − m2 cos(2 θ1 − 2 θ2)) |
and after solving the differntial equations using scipy library;
from scipy.integrate import odeint
sol = odeint(self.sys_ode, x0, [0, self.dt], args=(action, ))
self.theta1, self.theta2, self.dtheta1, self.dtheta2 = sol[-1, 0], sol[-1, 1], sol[-1, 2], sol[-1, 3]
and then simply plotting the results after calculating the positions of the masses we get:
but as you see we still have the problem of friction, without it the model is not realistic enough and moduling the previous equations into python, and for that we solve the problem by using Dynamics of Manipulators (we consider the Double Pendulum as a 2DOF manipulator).
The Equation of motion for most mechanical systems may be written in following form:
where:
$\mathbf{Q} \in \mathbb{R}^n$ - generalized forces corresponding to generilized coordinates$\mathbf{Q}_d \in \mathbb{R}^n$ - generalized disippative forces (for instance friction)$\mathbf{q} \in \mathbb{R}^{n}$ - vector of generilized coordinates$\mathbf{D} \in \mathbb{R}^{n \times n}$ - positive definite symmetric inertia matrix$\mathbf{C} \in \mathbb{R}^{n \times n}$ - describe 'coefficients' of centrifugal and Coriolis forces$\mathbf{g} \in \mathbb{R}^{n}$ - describes effect of gravity and other position depending forces$\mathbf{h} \in \mathbb{R}^n$ - combined effect of$\mathbf{g}$ and$\mathbf{C}$
In order to find the EoM we will use the Lagrange-Euler equations:
where:
$\mathcal{L}(\mathbf{q},\dot{\mathbf{q}}) \triangleq E_K - E_\Pi \in \mathbb{R}$ Lagrangian of the system$\mathcal{R} \in \mathbb{R}$ Rayleigh function (describes energy dissipation)
and here we add two dissipative elements in this system, namely "dampers" with coefficients
and after applying Lagrange formalism to obtain equations of motion;
Now we can find the
and so we get:
for balancing the task is simple enough to be solved by a the idea of a simple reward function; since the DP is starting from around the vertical position, we just give negatice reward for speeds and the theta1 and theta2 to be far away from the vertical position as follows;
we've tried over 50 different reward functions (the models and the some of the reward functions are uploaded) to make it swing up AND balance! and we couldn't succeed until we used an if statement in the reward function, and for continuous input RL problems, continuous Reward functions might not work! for example one of the reward functions (that will be explained in details);
it was trained for 10,000,000 timesteps for 24 hours and it could not swing up and balance, it could only swing up. then it occured to us since it seems that it needs 2 agents, one for swing up and the other for balancing vertically , we decided to use if statement in the reward function.
The task that the agent must perform consists of two phases. In the first one, it has to swing the second pendulum vertically.
In the second, while keeping the balance, it has to move the first pendulum to the target point.
For this reason, the reward function has been split into two expressions.
The first is the weighted sum of the linear dependencies of pendulum deflection angles.
It is awarded when the second pendulum is inclined from the vertical
by an angle of more than 10°. The values of the parameters of this sum were selected to
promote the swing of the second pendulum more than to align the first one. The
second formula works when the angle of the pendulum to the vertical is less than 10°. In
this phase, the agent must be concerned mainly with not losing his balance and moving the first pendulum closer to the target point.
For this reason, this part of the function is a linear
dependence of only the angle of the first pendulum plus a penalty for loss of
balance. (all angles in the equation are normalized,