机器学习与组合优化

2023年11月10日 1.8k 字大概 7 分钟

ML4CO

[1]组合优化和分支定界的背景：pdf

[2]GNN做branching的经典文章：pdf

[3,4]GNN在线性优化/组合优化问题上表达能力和逼近能力的理论基础：pdf pdf

[5]通用近似定理：pdf

[1]要点梳理

1 Introduction

分支定界法递归地将解空间以搜索树的形式划分，然后计算松弛问题的解，再对不能达到比现有解更优的子树进行剪枝。这个反复迭代的过程需要进行顺序决策（sequential decision making），如节点选择（node selection），即选择下一个做子问题拆解的结点；以及变量选择（variable selection），选择对于哪个变量进行划分子树。这种决策过程通常服从一系列由专家设计的硬编码的（hard-coded）（1）启发式规则。这些规则的目的是最小化在一组代表性MILP实例上的平均解决时间。但是在很多语境下经常需要反复求解一系列相似度的组合优化问题，这些问题可能与通常用来评估分支定界算法的实例集合有显著差异（2）。因此，使用统计学习方法是具有吸引力的；然而这面临着一些挑战。

编码MILP B&B决策过程的状态并不容易，尤其是因为搜索树和整数线性规划都可能具有可变的结构和大小；
如何构建一个模型架构，以便产生可以至少泛化到类似实例，理想情况下还能泛化到训练期间未见过的更大实例的规则，这一点也不清楚。

文章提出用图卷积网络（GCNN）来解决上面的问题。文章专注于变量选择问题，也就是分支问题（branching problem），这一问题是B&B范式的核心，但从理论上来说仍未被充分理解。文章采用模仿学习策略来快速近似强分支（strong branching），这是一种高质量但成本高昂的分支规则。

虽然上述想法不算新颖，文章基于此有两点主要贡献：

将branching策略编码到GCNN中，这使得能够利用MILP问题的自然二分图表示，从而减少手动特征工程的工作量；
使用行为克隆和交叉熵损失来近似强分支决策，这比预测强分支得分或排名更为简单。

文章在四类NP难题上评估了上述方法，包括集合覆盖、组合拍卖、有容量的设施定位和最大独立集问题。结果表明，文章选择的模型、状态编码和训练程序导致的策略在传统分支规则上有显著的改进，并且能够很好地泛化到比训练中使用的更大的实例。

1.理解为这些规则是预先设定并固定在算法中的，而不是在运行时动态生成或学习的；

2.暂时理解为泛化性能不佳，只能在一些相似的问题上，strong branching才有效。

3 Background

3.2 分支规则

到目前为止，一致性的分支策略结果表明，在所有 B&B 树中选择最小树是有力的分支。它通过在每个可能的分支点运行强分支来实现，这个分支计算每个候选变量在解决问题的预期贡献，并为每个候选变量选择最好的下界。然而，强分支不幸地需要为每个候选变量解决两个线性规划（LP），而在每个节点上运行强分支在实践中是不现实的，现代 B&B 求解器转而依赖于混合分支，它通过在仅在求解过程的开始计算强分支得分然后逐渐借还到更简单的启发式方法如：冲突得分（conflict score）或伪成本（pseudo-cost）或两者的人工组合。

3.3 马尔可夫决策过程公式化

B&B 过程中做出的连续决策可以被同化为马尔可夫决策过程。考虑一个求解器作为环境，分支器（brancher）作为代理。在第个决策中，求解器处于一个状态，该状态包括了 B&B 树的所有过去分支决策，到目前为止找到的最佳整数解，每个节点的 LP 解，当前关注的叶节点以及其他任何求解器统计信息（例如，调用了多少次每个原始启发式算法）。然后分支器从当前关注的节点的所有分数变量中选择一个变量，根据一个策略。然后求解器扩展 B&B 树，解决两个孩子 LP 松弛，运行任何内部启发式，如果合适就剪枝树，并最终选择下一个要分割的叶节点。我们随后处于一个新状态，并且分支器再次被调用来做出下一个分支决策。这个过程如下图所示。

作为马尔可夫决策过程，B&B 是 episodic 的，其中每个 episode 相当于解决一个 MILP 实例。初始状态对应于在一组兴趣中采样的实例，而最终状态标志着优化过程的结束。轨迹的概率取决于分支策略和求解器的其余组件，因此可以表示为：

注：原文这里的公式可以简化。其推导见这里

我的疑问

文章只针对进行学习，但是具体是怎么得到的我还不清楚。换言之选取了后，具体怎么划分子问题？这应该超出了“branching”的范畴。

4 Methodology

由于B&B变量选择问题可以被表述为马尔可夫决策过程，因此训练策略的自然方式是强化学习。然而，这种方法遇到了许多问题。值得注意的是，由于episode length与性能成正比，而随机初始化的策略表现不佳，标准的强化学习算法在训练初期通常非常慢，以至于总训练时间过长。此外，一旦选择了与实例对应的初始状态，剩余的过程就是实例特定的，所以马尔可夫决策过程往往非常庞大。在这项工作中，我们选择直接从一个专家分枝规则中学习，这种方法通常被称为模仿学习。

4.1 模仿学习

先跑expert程序得到专家的决策组合，利用交叉熵损失：

4.2 状态编码

的编码。考虑携带特征的二部图含义列举如下：

Tensor	Feature	Description
	obj_cos_sim	Cosine similarity with objective.
	bias	Bias value, normalized with constraint coefficients.
	is_tight	Tightness indicator in LP solution.
	dualsol_val	Dual solution value, normalized.
	age	LP age, normalized with total number of LPs.
	coef	Constraint coefficient, normalized per constraint.
	type	Type (binary, integer, impl. integer, continuous) as a one-hot encoding.
	coef	Objective coefficient, normalized.
	has_lb	Lower bound indicator.
	has_ub	Upper bound indicator.
	sol_is_at_lb	Solution value equals lower bound.
	sol_is_at_ub	Solution value equals upper bound.
	sol_frac	Solution value fractionality.
	basis_status	Simplex basis status (lower, basic, upper, zero) as a one-hot encoding.
	reduced_cost	Reduced cost, normalized.
	age	LP age, normalized.
	sol_val	Solution value.
	inc_val	Value in incumbent.
	avg_inc_val	Average value in incumbents.

4.3 branching 规则的参数化

网络结构如下：

逻辑斯蒂回归

2023年01月18日 900 字大概 5 分钟

Logistic Regression

In this ungraded lab, you will - explore the sigmoid function (also known as the logistic function) - explore logistic regression; which uses the sigmoid function

import numpy as np
%matplotlib widget
import matplotlib.pyplot as plt
from plt_one_addpt_onclick import plt_one_addpt_onclick
from lab_utils_common import draw_vthresh
plt.style.use('./deeplearning.mplstyle')

Sigmoid or Logistic Function

As discussed in the lecture videos, for a classification task, we can start by using our linear regression model, , to predict given .

However, we would like the predictions of our classification model to be between 0 and 1 since our output variable is either 0 or 1.
This can be accomplished by using a "sigmoid function" which maps all input values to values between 0 and 1.

Let's implement the sigmoid function and see this for ourselves.

Formula for Sigmoid function

The formula for a sigmoid function is as follows -

In the case of logistic regression, z (the input to the sigmoid function), is the output of a linear regression model. - In the case of a single example, is scalar. - in the case of multiple examples, may be a vector consisting of values, one for each example. - The implementation of the sigmoid function should cover both of these potential input formats. Let's implement this in Python.

NumPy has a function called exp(), which offers a convenient way to calculate the exponential ( ) of all elements in the input array (z).

It also works with a single number as an input, as shown below.

# Input is an array. 
input_array = np.array([1,2,3])
exp_array = np.exp(input_array)

print("Input to exp:", input_array)
print("Output of exp:", exp_array)

# Input is a single number
input_val = 1  
exp_val = np.exp(input_val)

print("Input to exp:", input_val)
print("Output of exp:", exp_val)

Input to exp: [1 2 3]
Output of exp: [ 2.72  7.39 20.09]
Input to exp: 1
Output of exp: 2.718281828459045

The sigmoid function is implemented in python as shown in the cell below.

def sigmoid(z):
    """
    Compute the sigmoid of z

    Args:
        z (ndarray): A scalar, numpy array of any size.

    Returns:
        g (ndarray): sigmoid(z), with the same shape as z
         
    """

    g = 1/(1+np.exp(-z))
   
    return g

Let's see what the output of this function is for various value of z

# Generate an array of evenly spaced values between -10 and 10
z_tmp = np.arange(-10,11)

# Use the function implemented above to get the sigmoid values
y = sigmoid(z_tmp)

# Code for pretty printing the two arrays next to each other
np.set_printoptions(precision=3) 
print("Input (z), Output (sigmoid(z))")
print(np.c_[z_tmp, y])

Input (z), Output (sigmoid(z))
[[-1.000e+01  4.540e-05]
 [-9.000e+00  1.234e-04]
 [-8.000e+00  3.354e-04]
 [-7.000e+00  9.111e-04]
 [-6.000e+00  2.473e-03]
 [-5.000e+00  6.693e-03]
 [-4.000e+00  1.799e-02]
 [-3.000e+00  4.743e-02]
 [-2.000e+00  1.192e-01]
 [-1.000e+00  2.689e-01]
 [ 0.000e+00  5.000e-01]
 [ 1.000e+00  7.311e-01]
 [ 2.000e+00  8.808e-01]
 [ 3.000e+00  9.526e-01]
 [ 4.000e+00  9.820e-01]
 [ 5.000e+00  9.933e-01]
 [ 6.000e+00  9.975e-01]
 [ 7.000e+00  9.991e-01]
 [ 8.000e+00  9.997e-01]
 [ 9.000e+00  9.999e-01]
 [ 1.000e+01  1.000e+00]]

The values in the left column are z, and the values in the right column are sigmoid(z). As you can see, the input values to the sigmoid range from -10 to 10, and the output values range from 0 to 1.

Now, let's try to plot this function using the matplotlib library.

# Plot z vs sigmoid(z)
fig,ax = plt.subplots(1,1,figsize=(5,3))
ax.plot(z_tmp, y, c="b")

ax.set_title("Sigmoid function")
ax.set_ylabel('sigmoid(z)')
ax.set_xlabel('z')
draw_vthresh(ax,0)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

As you can see, the sigmoid function approaches 0 as z goes to large negative values and approaches 1 as z goes to large positive values.

Logistic Regression

A logistic regression model applies the sigmoid to the familiar linear regression model as shown below:

where

Let's apply logistic regression to the categorical data example of tumor classification.
First, load the examples and initial values for the parameters.

x_train = np.array([0., 1, 2, 3, 4, 5])
y_train = np.array([0,  0, 0, 1, 1, 1])

w_in = np.zeros((1))
b_in = 0

Try the following steps: - Click on 'Run Logistic Regression' to find the best logistic regression model for the given training data - Note the resulting model fits the data quite well. - Note, the orange line is '' or above. It does not match the line in a linear regression model. Further improve these results by applying a threshold. - Tick the box on the 'Toggle 0.5 threshold' to show the predictions if a threshold is applied. - These predictions look good. The predictions match the data - Now, add further data points in the large tumor size range (near 10), and re-run logistic regression. - unlike the linear regression model, this model continues to make correct predictions

plt.close('all') 
addpt = plt_one_addpt_onclick( x_train,y_train, w_in, b_in, logistic=True)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

线性回归与梯度下降

2023年01月17日 336 字大概 2 分钟

线性回归与梯度下降

Linear regression

The model function for linear regression, which is a function that maps from x to y is represented as .
To train a linear regression model, you want to find the best parameters that fit your dataset.
- To compare how one choice of is better or worse than another choice, you can evaluate it with a cost function
  - is a function of . That is, the value of the cost depends on the value of .
- The choice of that fits your data the best is the one that has the smallest cost .
To find the values that gets the smallest possible cost , you can use a method called gradient descent.
- With each step of gradient descent, your parameters come closer to the optimal values that will achieve the lowest cost .
The trained linear regression model can then take the input feature and output a prediction .

Gradient Descent

Gradient descent involves repeated steps to adjust the value of your parameter to gradually get a smaller and smaller cost .

At each step of gradient descent, it will be helpful for you to monitor your progress by computing the cost as gets updated.
In this section, you will implement a function to calculate so that you can check the progress of your gradient descent implementation.

Cost function

As you may recall from the lecture, for one variable, the cost function for linear regression is defined as

is the model's prediction through the mapping, as opposed to , which is the actual value from the dataset.
is the number of training examples in the dataset.

Model prediction

For linear regression with one variable, the prediction of the model for an example is represented as .

This is the equation for a line, with an intercept and a slope .

Algorithm

The gradient descent algorithm is:

where, parameters are both updated simultaniously and where
* m is the number of training examples in the dataset.

is the model's prediction, while , is the target value.