Blog | Yuchao's Personal Website

机器学习(一)（Machine Learning I）

Sun, 05 Oct 2025 00:00:00 +0000

Overview of Machine Learning

The beginning of machine learning can be traced back to the 1950s when Arthur Samuel, a pioneer in the field of artificial intelligence, developed a program that could play checkers and improve its performance over time through experience. Since then, machine learning has evolved significantly, with advancements in algorithms, computing power, and the availability of large datasets.

Machine learning mainly work on the performance of a specific task on computer system and the optimization of algorithms to improve that performance over time.The Machine Learning system use giant amount of data to learn and make predictions or decisions without being explicitly programmed for every possible scenario.

Machine Learning

Difference between AI, Machine Learning, and Deep Learning

AI vs Machine Learning vs Deep Learning

Artificial Intelligence (AI): AI is a broad field of computer science that focuses on creating systems capable of performing tasks that typically require human intelligence. This includes problem-solving, reasoning, learning, and understanding natural language. AI encompasses various techniques, including rule-based systems, expert systems, and machine learning.
Machine Learning (ML): ML is a subset of AI that focuses on developing algorithms and statistical models that enable computers to learn from and make predictions or decisions based on data. Instead of being explicitly programmed for every task, ML systems improve their performance over time as they are exposed to more data. Common ML techniques include supervised learning, unsupervised learning, and reinforcement learning.
Deep Learning (DL): DL is a specialized subset of ML that uses neural networks with many layers (deep networks) to analyze various factors of data. DL has gained popularity due to its success in tasks such as image and speech recognition, natural language processing, and game playing. It requires large amounts of data and significant computational power to train deep neural networks effectively.

History of Machine Learning

1950-1970: Alan Turing proposed the concept of a machine that could simulate human intelligence. Frank Rosenblatt developed the Perceptron, an early neural network model. Arthur Samuel developed a checkers-playing program that improved through experience.
1970-1980: Knowledge-based systems and expert systems gained popularity. The backpropagation algorithm for training neural networks was introduced. Expert systems are AI programs that mimic the decision-making abilities of a human expert in a specific domain. They use a set of rules and knowledge bases to provide solutions or recommendations based on input data. It is often a knowledge-driven approach, represented as “if-then” rules. Decision trees are introduced as a simple yet effective machine learning algorithm for classification and regression tasks. Besides, Bayesian theorem start to be used in machine learning for probabilistic inference and decision-making.
1980-2000: Data driven and statistical methods became more prominent. Decision trees are evolved into more advanced algorithms like C4.5. Support Vector Machines (SVMs) were introduced for classification tasks. Unsupervised learning techniques like clustering and K-means gained attention. Random forests and ensemble methods were developed to improve predictive performance.
2000-2020: As the computing power increased and big data became more available, machine learning experienced significant growth. Machine learning break the performance bottleneck of image, voice and text. was introduced. , a deep CNN, won the ImageNet competition in 2012, marking a significant milestone in deep learning. were introduced for generating realistic data. AIGC(Artificial Intelligence Generated Content) became a hot topic. Reinforcement learning gained attention with successes like AlphaGo defeating human champions in the game of Go. were introduced in 2017, revolutionizing natural language processing tasks.
2020-Present: Large language models (LLMs) like GPT-3, GPT-4 and BERT have demonstrated impressive capabilities in natural language understanding and generation. Multi-modal techniques have emerged, allowing models to process and generate content across different modalities (e.g., text, image, audio). CLIP from OpenAI and Gato from Google DeepMind has shown the potential of combining vision and language. Generalized machine learning techniques have emerged, including Self-supervised learning, which has gained traction, allowing models to learn from unlabeled data. This work significantly reduce the labor of labeling data. Federated learning has emerged as a way to train models across decentralized devices while preserving data privacy. Explainable AI (XAI) has become a focus to enhance transparency and interpretability of machine learning models.

Machine Learning Applications

Machine learning has a wide range of applications across various industries and domains. Some common applications include:

Image and Speech Recognition: Machine learning algorithms are used in facial recognition systems, voice assistants (e.g., Siri, Alexa), and image classification tasks.
Natural Language Processing (NLP): Machine learning is used in language translation, sentiment analysis, chatbots, and text generation.
Recommendation Systems: Platforms like Netflix, Amazon, and Spotify use machine learning to recommend products, movies, or music based on user preferences and behavior.
Healthcare: Machine learning is used for medical image analysis, disease diagnosis, drug discovery, and personalized treatment recommendations. It is also used in protein and virus structure design, such as AlphaFold from DeepMind.
Finance: Machine learning is used for fraud detection, algorithmic trading, credit scoring, and risk assessment.

Machine Learning Terminology

Dataset: A collection of data used for training and evaluating machine learning models. It typically consists of input features and corresponding labels or target values.
- Training Set: A subset of the dataset used to train a machine learning model. The model learns patterns and relationships from this data.
- Validation Set: A subset of the dataset used to tune hyperparameters and evaluate the model’s performance during training.
- Test Set: A subset of the dataset used to assess the final performance of a trained machine learning model on unseen data.
Model: A mathematical representation of a system or process that is trained to make predictions or decisions based on input data.
Sample: An individual data point or instance in a dataset, consisting of input features and corresponding labels or target values.
Feature: An individual measurable property or characteristic of the data used as input to a machine learning model.
Feature Vector: A vector that represents the features of a sample in a dataset. It is typically a numerical representation of the input data.
Label: The target variable or output that a machine learning model aims to predict or classify.
Training: The process of feeding a machine learning model with labeled data to learn patterns and relationships.
Testing: The process of evaluating a trained machine learning model on unseen data to assess its performance.
Overfitting: A situation where a machine learning model performs well on the training data but poorly on unseen data, indicating that it has learned noise or irrelevant patterns.
Underfitting: A situation where a machine learning model fails to capture the underlying patterns in the training data, resulting in poor performance on both training and unseen data.
Cross-validation: A technique used to assess the performance of a machine learning model by dividing the dataset into multiple subsets and training/testing the model on different combinations of these subsets.
Hyperparameters: Parameters that are set before training a machine learning model and control its learning process (e.g., learning rate, number of layers).
Parameters: Internal variables of a machine learning model that are learned during the training process (e.g., weights in a neural network).
Loss Function: A mathematical function that quantifies the difference between the predicted output of a machine learning model and the actual target values. The goal of training is to minimize the loss function.
Gradient Descent: An optimization algorithm used to minimize the loss function by iteratively adjusting the model’s parameters in the direction of the steepest descent of the loss function.
Epoch: A complete pass through the entire training dataset during the training process.
Batch Size: The number of training examples used in one iteration of training. It determines how many samples are processed before the model’s parameters are updated.
Learning Rate: A hyperparameter that controls the step size at which the model’s parameters are updated during training. A too high learning rate can lead to divergence, while a too low learning rate can result in slow convergence.
Activation Function: A mathematical function applied to the output of a neuron in a neural network to introduce non-linearity. Common activation functions include ReLU, sigmoid, and tanh.
Quantization: The process of reducing the precision of the numbers used to represent a model’s parameters, typically to decrease the model size and improve inference speed. Common quantization techniques include 8-bit integer quantization and 16-bit floating-point quantization.

Basic Rules of Machine Learning

Basic Components of Machine Learning

Machine Learning = Model + Strategy + Algorithm

Model: A mathematical representation of a system or process that is trained to make predictions or decisions based on input data.
Strategy: The approach or method used to train and evaluate a machine learning model, such as supervised learning, unsupervised learning, or reinforcement learning.
Algorithm: A set of rules or procedures used to solve a specific problem or perform a specific task, such as linear regression, decision trees, or neural networks. Data is also a very important part of machine learning. But data is not a part of machine learning itself. Data is the input to machine learning. The quality and quantity of data can significantly impact the performance of a machine learning model.

Classification of Machine Learning

Basically, machine learning can be classified into three main categories:

Types of Machine Learning

Supervised Learning: The model is trained on labeled data, where the input features and corresponding labels are provided. The goal is to learn a mapping from inputs to outputs. Examples include regression and classification tasks. Similarity learning can be considered as a combination of regression and classification. Its target is to learn the similarity between two inputs. It use similarity function to measure the distance(similarity, relevance) between two inputs. It is widely used in recommendation systems, facial recognition, and information retrieval.
Unsupervised Learning: The model is trained on unlabeled data, where only the input features are provided. The goal is to discover patterns or structures in the data. Examples include clustering and dimensionality reduction.
Half-supervised Learning: The model is trained on a combination of labeled and unlabeled data. This approach can be useful when labeled data is scarce or expensive to obtain.
Reinforcement Learning: The model learns to make decisions by interacting with an environment and receiving feedback in the form of rewards or penalties. The goal is to learn a policy that maximizes the cumulative reward over time.

We can also classify machine learning based on the type of model used, such as:

Linear Models: These models make predictions based on a linear combination of input features. Examples include linear regression and logistic regression.
Probabilistic Models: These models use probability theory to make predictions and quantify uncertainty. Examples include Bayesian networks and hidden Markov models.

Modeling Process

Using supervised learning as an example, the modeling process typically involves the following steps:

Data Collection: Gather a dataset that includes input features and corresponding labels. This dataset will be used for training and evaluating the model.
Data Preprocessing: Clean and preprocess the data to ensure its quality. This may involve handling missing values, normalizing or scaling features, and encoding categorical variables.
Feature Selection/Engineering: Identify the most relevant features for the task at hand. This may involve creating new features from existing ones or selecting a subset of features to improve model performance.
Model Selection: Choose an appropriate machine learning algorithm or model architecture based on the problem type and data characteristics. This could involve trying out different algorithms and comparing their performance.
Training: Train the selected model on the training dataset by feeding it the input features and corresponding labels. The model learns to map inputs to outputs during this phase.
Evaluation: Assess the model’s performance on a separate validation or test dataset. Common evaluation metrics include accuracy, precision, recall, F1-score, and area under the ROC curve (AUC-ROC).
Hyperparameter Tuning: Optimize the model’s hyperparameters to improve performance. This may involve techniques like grid search or random search to find the best combination of hyperparameters.
Deployment: Once the model is trained and evaluated, it can be deployed to a production environment where it can make predictions on new, unseen data.

From my point of view, we are trying to build the world by abstracting the real world into mathematical models. We use the vectors to represent the real world, and use the functions to represent the relationship between the vectors. Then based on theses vectors and functions, we can use some model to conclude the rules of the world. Then we can use these rules to predict the future or generate new things. In previous days, we need to explicitly define the rules of the world. But now, with the help of machine learning, we can let the computer learn the rules from the data. This is a big step forward in building the world. After decades of development, the machine learning model can describe the world in a black box way. We can use the model to predict the future or generate new things, but we don’t know how the model works inside. This is a big challenge for us. We need to find a way to understand the model better, and make it more explainable. This is the future direction of machine learning.(Interpretable Machine Learning) As the parameters of the model are increasing exponentially, (billion to trillion level) the interpretability of the model is becoming more and more challenging and important.

Tenstorrent HPC快速上手指南（The Quick Start Guide to Tenstorrent HPC）

Thu, 02 Oct 2025 00:00:00 +0000

Official Documentation:

Introduction to Tenstorrent

Tenstorrent is a Canadian AI hardware and software company that designs and manufactures high-performance processors for machine learning and artificial intelligence applications. Founded in 2016 by Ljubomir Perkovic, Tenstorrent aims to provide cutting-edge solutions for AI workloads, focusing on efficiency, scalability, and performance. Tenstorrent has multiple levels of software stack, including:

: The low-level, open-source software development kit (SDK) that provides developers direct access to Tenstorrent hardware. It is a bare-metal programming environment designed for users who must write custom C++ kernels for machine learning or other high-performance computing workloads.

Tenstorrent Software Stack

: Tenstorrent’s Multi-Level Intermediate Representation (MLIR)-based compiler. It bridges high-level machine learning frameworks with the Tenstorrent software stack.
: A library of neural network operations that provides a user-friendly interface for running models on Tenstorrent hardware. It is designed to be intuitive for developers familiar with PyTorch.
: Deprecated.

Tenstorrent Hardware Overview

Overview

Tensix Core

Large RISC-V Core

ARC Core

DRAM Core

Ethernet Core

PCIE Core

Tensix Core

Baby RISC-V core
Router
Compute
Buffer

Ascend快速上手指南（The Quick Start Guide to Ascend）

Tue, 30 Sep 2025 00:00:00 +0000

PENDING CONSTRUCTION

CUDA快速上手指南（The Quick Start Guide to CUDA）

Tue, 30 Sep 2025 00:00:00 +0000

The architecture of NVIDIA GPU

GPU is originally designed for graphics rendering. It has a large number of cores that can handle multiple tasks simultaneously, making it ideal for parallel computing. The architecture of NVIDIA GPU consists of several key components:

NVIDIA GPU Architecture

Streaming Multiprocessors (SMs): The SM is the core of the GPU. Each SM contains multiple CUDA cores, which are responsible for executing instructions. Each SM has multiple blocks, and each block contains multiple threads. The SM manages the execution of threads and blocks, and it also handles memory access and scheduling. The shared memory and registers are also part of the SM, which are used for fast data access and storage. Each SM shares one shared memory space among its threads. Each block in the SM shares one register file. Proper utilization of these resources is crucial for achieving high performance. The number of SMs in a GPU varies depending on the model. For example, the NVIDIA A100 GPU has 108 SMs, while the NVIDIA RTX 3090 has 82 SMs.

NVIDIA CUDA Cores

CUDA Cores: CUDA cores are the basic processing units of the GPU. They are similar to CPU cores but are optimized for parallel processing. Each SM contains multiple CUDA cores, allowing it to execute many threads simultaneously. The CUDA cores are used to be called “stream processors”, and the name “CUDA core” was firstly introduced with the Fermi architecture.

NVIDIA Tensor Cores

Tensor Cores: Tensor Cores are specialized processing units within NVIDIA GPUs designed to accelerate deep learning and matrix operations. They are optimized for performing mixed-precision matrix multiplications and accumulations, which are common in neural network training and inference. Tensor Cores can significantly speed up computations by handling multiple operations in parallel, making them ideal for AI workloads. (Pending Construction)

NVIDIA GPU Memory Hierarchy

Memory Hierarchy: The memory hierarchy of NVIDIA GPU includes several types of memory, each with different access speeds and sizes:
- Global Memory: The largest and slowest memory, accessible by all threads. It is used to store data that needs to be shared among threads.
- Shared Memory: A smaller and faster memory, shared among threads within the same block. It is used for data that needs to be accessed frequently by threads in the same block.
- Registers: The fastest memory, used to store temporary variables for each thread. Each thread has its own set of registers.

Besides these components, NVIDIA GPUs also include other features such as L1 and L2 caches, texture units, and memory controllers to optimize performance and efficiency.

CUDA Programming Model

CUDA is a parallel computing platform and programming model developed by NVIDIA. It allows developers to use NVIDIA GPUs for general-purpose computing. The CUDA programming model consists of several key concepts:

CUDA Program Model

Kernels: A kernel is a function that is executed on the GPU. It is defined using the __global__ keyword in CUDA C/C++. When a kernel is launched, it is executed by multiple threads in parallel.
Threads: A thread is the basic unit of execution in CUDA. Each thread executes a single instance of a kernel. Threads are organized into blocks and grids.
Warps: A warp is a group of 32 threads that are executed simultaneously on a single SM. All threads in a warp execute the same instruction at the same time, which allows for efficient execution of parallel code. The warp is created and managed by blocks.
Blocks: A block is a group of threads that can cooperate with each other by sharing data through shared memory. Blocks are executed on a single SM. Each block can contain up to 1024 threads (depending on the GPU architecture). Notably, SM is not dedicated to a single block; it can handle multiple blocks concurrently, and it is not involved in the index calculation of threads.
Grids: A grid is a collection of blocks. When a kernel is launched, it is executed by a grid of blocks. The grid can be one-dimensional, two-dimensional, or three-dimensional, allowing for flexible organization of threads.

From top to bottom, the hierarchy is: Grid -> Blocks -> Warps -> Threads -> Kernels. (Warp is an execution grouping of threads, not a container above threads)

The Ampere SM Architecture

Ampere SM Architecture

Basically, we can consider each CUDA core as a small CPU core, each block contains multiple CUDA cores, and each grid contains multiple blocks. The kernel is the function that runs on each CUDA core (thread).

CUDA Memory Management

CUDA provides several APIs for managing memory on the GPU. Developers can allocate and free memory on the GPU using functions like cudaMalloc() and cudaFree(). Data can be transferred between the host (CPU) and device (GPU) using functions like cudaMemcpy(). CUDA also supports unified memory, which allows the CPU and GPU to share a single memory space.

The memory on the GPU is organized into row vectors. Each row vector contains multiple elements, and each element can be accessed using its index. The key to have efficient algorithms on CUDA is to have coalesced memory access. This means that threads in a warp should access consecutive memory locations to maximize memory bandwidth. For example, if thread 0 accesses element 0, thread 1 should access element 1, and so on. This allows the GPU to fetch data for all threads in a single memory transaction, improving performance.

CUDA is the game to play with thread indexing and memory indexing.

CUDA Execution Configuration

When launching a kernel, developers need to specify the execution configuration, which includes the number of blocks and threads per block. This configuration determines how many threads will be executed in parallel on the GPU. The execution configuration is specified using the <<<...>>> syntax when calling a kernel.

The execution on CUDA is unsynchronized by default. This means that when a kernel is launched, the CPU continues executing the next instructions without waiting for the kernel to finish. To synchronize the CPU and GPU, developers can use functions like cudaDeviceSynchronize().

The thread indexing in CUDA is done using built-in variables like threadIdx, blockIdx, and blockDim. These variables allow developers to determine the unique index of each thread within a block and grid, enabling them to access specific data elements in memory. Each thread has its own unique index, which can be calculated using these built-in variables.

CUDA Thread Indexing

threadIdx, blockIdx, and blockDim are Dim3 variables, which means they can be used in 1D, 2D, or 3D configurations. This allows for flexible organization of threads and blocks. Notably, the high dimension indexing is only for logical organization; the actual physical layout is still linear.

To calculate the global index of a thread in a 2D grid, you can use the following formula:

$$ \text{global}_x = \text{blockIdx}.x \times \text{blockDim}.x + \text{threadIdx}.x \\ \text{global}_y = \text{blockIdx}.y \times \text{blockDim}.y + \text{threadIdx}.y \\ \text{idx} = \text{global}_y \times \text{width} + \text{global}_x \\ \text{idx} = (\text{blockIdx}.y \times \text{blockDim}.y + \text{threadIdx}.y) \times (\text{blockDim}.x * \text{gridDim}.x) \\ + (\text{blockIdx}.x \times \text{blockDim}.x + \text{threadIdx}.x) $$

Where width is the total width of the data being processed, the number of CUDA cores in a block. This formula calculates the global x and y coordinates of the thread based on its block and thread indices, and then computes a linear index idx that can be used to access elements in a 1D array representation of the 2D data.

For example, as shown in the figure, if you want to calculate the index of a thread 12 in block 12 in a 2D grid with 5 blocks in the x direction and 4 blocks in the y direction, and each block has 5 threads in the x direction and 5 threads in the y direction, you can use the formula above to calculate the index as follows:

$$ \text{global}_x = 2 \times 5 + 2 = 12 \\ \text{global}_y = 2 \times 5 + 2 = 12 \\ \text{idx} = 12 \times 20 + 12 = 252 \\ $$

When the kernel is launched, the GPU will execute the kernel on multiple threads in parallel. Each thread will have its own unique index, which can be used to access specific data elements in memory.

Besides, the warp size is 32, which means that 32 threads are executed simultaneously on a single SM. Therefore, it is recommended to use a multiple of 32 for the number of threads per block to ensure optimal performance.

If the number of threads per block is not a multiple of 32, some threads in the last warp may be idle, leading to suboptimal performance.
If the number of threads is larger than 32, multiple warps will be created, and the GPU will schedule the warps for execution.
The maximum number of threads per block is 1024 for most NVIDIA GPUs. However, the optimal number of threads per block may vary depending on the specific GPU architecture and the nature of the kernel being executed. It is recommended to experiment with different configurations to find the optimal settings for a specific application. When considering the number of threads per block, it is also important to take into account the amount of shared memory and registers used by each thread, as these resources are limited on the GPU.

CUDA Divergence

Divergence occurs when threads within a warp take different execution paths due to conditional statements (e.g., if-else statements). Since all threads in a warp execute the same instruction at the same time, divergence can lead to performance degradation. When threads diverge, the warp must execute each path sequentially, which can result in some threads being idle while others are executing.

Compute Kernel Design

When designing compute kernels for CUDA, it is important to consider the following best practices:

Maximize Parallelism: Design kernels to maximize the number of threads that can be executed in parallel. This can be achieved by breaking down the problem into smaller tasks that can be executed independently.
Optimize Memory Access: Ensure that threads access memory in a coalesced manner to maximize memory bandwidth. Use shared memory to store frequently accessed data and minimize global memory accesses.
Minimize Divergence: Avoid conditional statements that can lead to divergence within a warp. If divergence is unavoidable, try to structure the code to minimize its impact.
Use Appropriate Execution Configuration: Choose the number of blocks and threads per block based on the specific GPU architecture and the nature of the kernel being executed. Experiment with different configurations to find the optimal settings for a specific application.
Profile and Optimize: Use profiling tools to analyze the performance of the kernel and identify bottlenecks. Optimize the kernel based on the profiling results to improve performance.

Tensor Cores

Tensor Cores are specialized processing units within NVIDIA GPUs designed to accelerate deep learning and matrix operations. They are optimized for performing mixed-precision matrix multiplications and accumulations, which are common in neural network training and inference. Tensor Cores can significantly speed up computations by handling multiple operations in parallel, making them ideal for AI workloads.

CUDA Libraries

NVIDIA provides several libraries that can be used to simplify CUDA programming and improve performance. Some of the most commonly used libraries include:

CUTLASS

CUTLASS (CUDA Templates for Linear Algebra Subroutines and Solvers) is a collection of CUDA C++ template abstractions for implementing high-performance matrix-multiplication (GEMM) at all levels and scales within CUDA. CUTLASS provides a flexible and efficient way to perform matrix operations on NVIDIA GPUs, making it easier for developers to implement custom algorithms that leverage the power of CUDA.

cuBLAS

cuBLAS is a GPU-accelerated library for dense linear algebra that provides a set of basic linear algebra subroutines (BLAS) for CUDA. It is designed to deliver high performance for matrix-matrix and matrix-vector operations on NVIDIA GPUs, making it a popular choice for deep learning and scientific computing applications.

矢量（Vector）

Sat, 27 Sep 2025 00:00:00 +0000

以下文字叙述仅适用于高中及以下阶段，不适用于高等教育。

The following only works in K12 level not for higher education.

最近在使用梯度的时候，我发现自己其实并不真正理解矢量的概念。因此，我想写下自己对如何更好理解矢量的思考。

当我第一次在数学课上学习矢量时，老师告诉我，标量只有大小，而矢量既有大小又有方向。但我一直没有真正理解这一点，我更多关注的是方向部分，因为我被告知这是它与标量最大的区别。但在数学中，方向并不总是很明显。例如，在二维空间中，一个矢量可以表示为 (x, y)。那么这个矢量的方向是什么？是与 x 轴的夹角吗？还是与 y 轴的夹角？还是其他什么？当维度高于三维时，方向就更加令人困惑了。

此外，(x, y) 在二维空间中也是一个点。那么点和矢量有什么区别呢？老师说，点是空间中的一个位置，而矢量是方向和大小。但我还是不太明白。因为如果我们有两个点 (x1, y1) 和 (x2, y2)，我们可以通过相减得到一个矢量：(x2 - x1, y2 - y1)。所以在这种情况下，矢量也可以看作是空间中的一个位置。但两个点之间的运算是没有意义的，我们不能直接相加或相减两个点，只能对矢量进行加减运算。

几乎在同一时间，我在物理课上也学习了矢量。老师说速度是一个矢量，因为它既有大小又有方向。但在解题时，我们通常只关心大小。例如，如果一辆车以 60 公里每小时的速度行驶，我们通常不关心方向，只关心它有多快。所以在这种情况下，速度可以被当作标量。但如果我们想知道车往哪个方向开，就需要知道方向，这时速度就是矢量。（速度在英文的定义中更为精准，英文中存在speed和velocity两个单词来描述速度，speed代表标量velocity代表矢量，但在中文中通常只用一个词来表示。）

这两种经历让我当时更加困惑。一个是在数学上无法理解，另一个是在物理上用其他方式表示（物理中的方向通常用文字而不是数学表达）。这让我思考，为什么它们都叫矢量，但在不同的语境下却有不同的处理方式。

经过长时间的思考，我对矢量有了新的理解：

从定义上讲，标量是单一的数值，而矢量是一组数值。
在数学中，标量是单一数值，矢量是一组数值。方向存在，但意义不明显（缺乏语境）。
在应用科学中，矢量包含比标量更多的信息，但我们通常用其他方式来表示这些信息。例如，在物理中，速度是矢量，因为它有大小和方向。但在解题时，我们通常只计算大小，用文字描述方向（高中阶段）。

既然数学无处不在，我认为应用科学的老师应该更多地用数学的方法来解释矢量，并告诉学生为什么我们只计算大小，以及在本领域内矢量的其余维度是如何用其他方式表示的。这样可以统一数学和应用科学中对矢量的定义，帮助学生更好地理解矢量的概念。

Recently, when I use the gradient, I realize that I don’t really understand vector. So I want to write down my thought of how to better understand the concept of vector.

When I first learn vector in math, the teacher told me that the difference between scalar and vector is that scalar has only magnitude, while vector has both magnitude and direction. But I never really understand that. I pay more attention on the direction part, because I was told that this is the biggest difference between it and scalar. But in math, the direction is not always obvious. For example, in 2D space, a vector can be represented as (x, y). But what is the direction of this vector? Is it the angle between the vector and the x-axis? Or is it the angle between the vector and the y-axis? Or is it something else? When the dimension is higher than 3, the direction is even more confusing.

Besides, (x,y) is also a point in 2D space. So what is the difference between a point and a vector? The teacher said that a point is a location in space, while a vector is a direction and magnitude. But I still don’t understand. Because if we have two points (x1, y1) and (x2, y2), we can create a vector by subtracting the two points: (x2 - x1, y2 - y1). So in this case, the vector is also a location in space. But the computation between two points is meaningless. We cannot add or subtract two points. We can only add or subtract two vectors.

Nearly in the same time, I learned vector in physics class. The teacher said the velocity is a vector, because it has both magnitude and direction. But when we solve problems, we usually only care about the magnitude. For example, if a car is moving at 60 kph, we usually don’t care about the direction. We just care about how fast it is moving. So in this case, the velocity can be treated as a scalar. But if we want to know where the car is going, then we need to know the direction. In this case, the velocity is a vector.

These two experiences made me more confused at that time. One is cannot understand(in math), the other is represented in other ways(direction in physics is always represented in words rather than in math). These made me think why they are all called vector, but got different treatment in different context.

Today, after long time thinking, I got some new understanding about vector.

In definition, scalar is a single value, while vector is a collection of values.
In math, scalar is a single value, while vector is a collection of values. The direction exist but meaning is not obvious(lack of context).
In applied science, vector has more information than scalar, but we always use other ways to represent the other information. For example, in physics, velocity is a vector because it has both magnitude and direction. But when we solve problems, we usually only compute the magnitude, and use words to describe the direction.(High school level)

As we need math every where, I think it is better for applied science teachers to explain the vector more in math way, and tell the student why we only compute the magnitude, and how the remaining dimensions of vectors in this domain are represented in other ways. This will align the definition of vector in both math and applied science, and help student to understand the concept of vector better.

机器学习数学基础（The Mathematical Foundations of Machine Learning）

Wed, 24 Sep 2025 00:00:00 +0000

Mathematical Foundations

Gradient

Assume a function $ f(x_1, x_2, \ldots, x_n) $ with multiple variables. The gradient of this function is a vector that contains all its partial derivatives:

$$ \nabla f(x) = \left( \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n} \right) $$

Process:

Compute the exact differential of the function $ f $: $$ df = \frac{\partial f}{\partial x_1} dx_1 + \frac{\partial f}{\partial x_2} dx_2 + \ldots + \frac{\partial f}{\partial x_n} dx_n $$
Create two vectors:
- The vector of partial derivatives: $$\boldsymbol{\alpha} = \left( \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n} \right) $$
- The vector of differentials: $$ \mathbf{dx} = (dx_1, dx_2, \ldots, dx_n) $$
The exact differential can be expressed as the dot product of these two vectors: $$ df = \boldsymbol{\alpha} \cdot \mathbf{dx} $$
When we compute the vector dot product, we get: $$ \boldsymbol{\alpha} \cdot \mathbf{dx} = |\boldsymbol{\alpha}| |\mathbf{dx}| \cos(\theta) $$ where $ \theta $ is the angle between the two vectors.
The value of $|\mathbf{dx}|$ is fixed, representing a small change in the input variables. The value of $ \cos(\theta) $ varies between -1 and 1, depending on the angle between the two vectors. The maximum value of $ df $ occurs when $ \theta = 0 $, meaning the two vectors are aligned. In this case: $$ df_{max} = |\boldsymbol{\alpha}| |\mathbf{dx}| $$ On the other word, the maximum rate of change of the function $ f $ occurs in the direction of the gradient vector $ \boldsymbol{\alpha} $. This matches the original expression The gradient is the combination of all partial derivatives. It points in the direction of the steepest ascent of the function. (Changes fastest in this direction) Follow the reversed gradient direction a little bit, then recompute the reversed gradient, follow the new direction, and repeat. Then we can reach the local minimum of the function. (Loss Function)

Norm

The norm of a vector is a measure of its length or magnitude.

L0 norm: The L0 norm of a vector counts the number of non-zero elements in the vector. It is defined as: $$ ||\mathbf{x}||_0 = \text{number of non-zero elements in } \mathbf{x} $$
L1 norm: The L1 norm of a vector is the sum of the absolute values of its elements. Also known as Manhattan distance, it is defined as: $$ ||\mathbf{x}||_1 = \sum_{i=1}^{n} |x_i| $$
L2 norm: The L2 norm of a vector is the square root of the sum of the squares of its elements. Also known as the Euclidean norm, it is defined as: $$ ||\mathbf{x}||_2 = \sqrt{\sum_{i=1}^{n} x_i^2} $$
L-p norm: The L-p norm of a vector is a generalization of the L1 and L2 norms. It is defined as: $$ ||\mathbf{x}||_p = \left( \sum_{i=1}^{n} |x_i|^p \right)^{1/p} $$ for any real number $ p \geq 1 $.
L-infinity norm: The L-infinity norm of a vector is the maximum absolute value among its elements. It is defined as: $$ ||\mathbf{x}||_\infty = \max_{1 \leq i \leq n} |x_i| $$

Probability

Independent events

$$ P(A \cap B) = P(A)P(B) $$

$$ P(A \cup B) = P(A) + P(B) - P(A)P(B) $$

Conditional Probability

$$ P(A \cap B) = P(A|B)P(B) = P(B|A)P(A) $$

$$ P(A \cup B) = P(A) + P(B) - P(A \cap B) $$

Check if two events are independent:

$$ P(A|B) = P(A) $$

$$ P(B|A) = P(B) $$

Bayes’ Theorem

Prior probability vs Posterior probability: Prior probability is the initial probability of an event before new evidence is considered. Posterior probability is the updated probability of an event after considering new evidence.
Full probability formula:
$$\begin{align*} P(B) &= P(A_1)P(B|A_1) + P(A_2)P(B|A_2) + \ldots \\ &= \sum_{i} P(B|A_i)P(A_i) \end{align*}$$
Bayes’ Theorem:
$$ P(A_i|B) = \frac{P(B|A_i)P(A_i)}{P(B)} = \frac{P(B|A_i)P(A_i)}{\sum_{j} P(B|A_j)P(A_j)} $$
The usage of Bayes’ theorem is to update the probability of an event based on new evidence. We can use the posterior probability to predict the likelihood of an event occurring given the new evidence. Given the machine learning context, we can use the dataset as “B”, and the model parameters as “A”. Then we can use Bayes’ theorem to update the model parameters based on the dataset.

Likelihood function

The likelihood function is a fundamental concept in statistics and machine learning. It is used to estimate the parameters of a statistical model given observed data. The likelihood function measures how likely it is to observe the given data for different values of the model parameters. On the other hand, the probability function measures the likelihood of observing a specific outcome given a set of parameters.

In more common terms, the likelihood function is a function of the parameters of a statistical model, given the observed data. The input to the likelihood function is the model parameters, and the output is the likelihood of observing the data given those parameters.

For example, suppose we have a statistical model that has a possible output 5. We only use two parameters x and y in the model. The likelihood function describes how likely it is to observe the output 5 for different values of x and y.

Assume a statistical model with parameters $ \theta $ and observed data $ D $. The likelihood function $ L(\theta | D) $ is defined as the probability of observing the data given the parameters:

$$ L(\theta | D) = P(D | \theta) $$

In machine learning, the likelihood function is often used in maximum likelihood estimation (MLE) to find the parameter values that maximize the likelihood of the observed data. This is done by finding the parameter values that make the observed data most probable. The data fits independently and identically distributed (i.i.d.) assumption is often made, which means that the data points are assumed to be independent of each other and drawn from the same distribution. The distribution here can be considered as the machine learning model. We can use the likelihood function to find the best model parameters that fit the data. Assume i.i.d. observations $ D = \{x_1, x_2, \ldots, x_n\} $. The likelihood function can be expressed as:

$$ L(\theta | D) = \prod_{i=1}^{n} P(x_i | \theta) $$

Since they are independent, the joint probability is the product of individual probabilities. To simplify the computation, we often work with the log-likelihood function, which is the natural logarithm of the likelihood function:

$$ \ell(\theta | D) = \log L(\theta | D) = \sum_{i=1}^{n} \log P(x_i | \theta) $$

Maximizing the log-likelihood is equivalent to maximizing the likelihood due to the monotonic nature of the logarithm. It is often easier to work with the log-likelihood due to its additive properties.

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) is a method used to estimate the parameters of a statistical model by maximizing the likelihood function. The goal of MLE is to find the parameter values that make the observed data most probable. In machine learning, we only got the dataset, and we want to find the best model parameters that fit the data. We can use MLE to achieve this by finding the parameter values that maximize the likelihood of the observed data. On the other word, we want to find the model parameters that make the data most likely to be the data in the dataset.