Posts

Machine Learning

Brief Introduction of Label Propagation Algorithm

As I said before, I’m working on a text classification project. I use doc2vec to convert text into vectors, then I use LPA to classify the vectors.

LPA is a simple, effective semi-supervised algorithm. It can use the density of unlabeled data to find a hyperplane to split the data.

Here are the main stop of the algorithm:

Let $ (x_1,y1)…(x_l,y_l)$ be labeled data, $Y_L = \{y_1…y_l\} $ are the class labels. Let $(x_{l+1},y_{l+u})$ be unlabeled data where $Y_U = \{y_{l+1}…y_{l+u}\}$ are unobserved, usually $l \ll u$. Let $X=\{x_1…x_{l+u}\}$ where $x_i∈ R^D$. The problem is to estimate $Y_U$ for $X$ and $Y_L$.
Calculate the similarity of the data points. The most simple metric is Euclidean distance. Use a parameter $σ$ to control the weights.

$$w_{ij}= exp(-\frac{d^2_{ij}}{σ^2})=exp(-\frac{∑^D_{d=1}{(x^d_i-x^d_j})^2}{σ^2})$$

Larger weight allow labels to travel through easier.

Define a $(l+u)*(l+u)$ probabilistic transition matrix $T$

$$T_{ij}=P(j → i)=\frac{w_{ij}}{∑^{l+u}_{k=1}w_{kj}}$$

$T_{ij}$ is the probability to jump from node $j$ to $i$. If there are $C$ classes, we can define a $(l+u)*C$ label matrix $Y$, to represent the probability of a label belong to class $c$. The initialization of unlabeled data points is not important.

Propagate $Y ← TY$
Row-normalize Y.
Reset labeled data’s Y. Repeat 3 until Y converges.

In short, let the nearest label has larger weight, then calculate each label’s new label, reset labeled data’s label, repeat.

Ref

LSTM and GRU

LSTM

The avoid the problem of vanishing gradient and exploding gradient in vanilla RNN, LSTM was published, which can remember information for longer periods of time.

Here is the structure of LSTM:

The calculate procedure are:

$$\begin{aligned} f_t&=σ(W_f⋅[h_{t-1},x_t]+b_f) i_t&=σ(W_i⋅[h_{t-1},x_t]+b_i)\\ o_t&=σ(W_o⋅[h_{t-1},x_t]+b_o)\\ ˜{C_t}&=tanh(W_C⋅[h_{t-1},x_t]+b_C)\\ C_t&=f_t∗ C_{t-1}+i_t∗ ˜{C_t}\\ h_t&=o_t ∗ tanh(C_t) \end{aligned}$$

$f_t$,$i_t$,$o_t$ are forget gate, input gate and output gate respectively. $˜{C_t}$ is the new memory content. $C_t$ is cell state. $h_t$ is the output.

Use $f_t$ and $i_t$ to update $C_t$, use $o_t$ to decide which part of hidden state should be outputted.

GRU

$$\begin{aligned} z_t&=σ(W_z⋅[h_{t-1},x_t]) r_t&=σ(W_r⋅[h_{t-1},x_t])\\ ˜{h_t}&=tanh(W⋅[r_t ∗ h_{t-1},x_t])\\ h_t&=(1-z_t)∗ h_{t-1}+z_t ∗ ˜{h_t} \end{aligned}$$

$z_t$ is update gate, $r_t$ is reset gate, $˜{h_t}$ is candidate activation, $h_t$ is activation.

Compare with LSTM, GRU merge cell state and hidden state to one hidden state, and use $z_t$ to decide how to update the state rather than $f_t$ and $i_t$.

Ref

Understanding LSTM Networks

Models and Architectures in Word2vec

Generally, word2vec is a language model to predict the words probability based on the context. When build the model, it create word embedding for each word, and word embedding is widely used in many NLP tasks.

Models

CBOW (Continuous Bag of Words)

Use the context to predict the probability of current word. (In the picture, the word is encoded with one-hot encoding, $W_{V*N}$ is word embedding, and $W_{V*N}^{‘}$, the output weight matrix in hidden layer, is same as $\hat{υ}$ in following equations)

Context words’ vectors are $υ_{c-n} … υ_{c+m}$ ($m$ is the window size)
Context vector $\hat{υ}=\frac{υ_{c-m}+υ_{c-m+1}+…+υ_{c+m}}{2m}$
Score vector $z_i = u_i\hat{υ}$, where $u_i$ is the output vector representation of word $ω_i$
Turn scores into probabilities $\hat{y}=softmax(z)$
We desire probabilities $\hat{y}$ match the true probabilities $y$.

We use cross entropy $H(\hat{y},y)$ to measure the distance between these two distributions. $$H(\hat{y},y)=-∑_{j=1}^{\lvert V \rvert}{y_jlog(\hat{y}_j)}$$

$y$ and $\hat{y}$ is accurate, so the loss simplifies to: $$H(\hat{y},y)=-y_jlog(\hat{y})$$

For perfect prediction, $H(\hat{y},y)=-1log(1)=0$

According to this, we can create this loss function:

$$\begin{aligned} minimize\ J &=-log P(ω_c\lvert ω_{c-m},…,ω_{c-1},…,ω_{c+m}) &= -log P(u_c \lvert \hat{υ}) \\ &= -log \frac{exp(u_c^T\hat{υ})}{∑_{j=1}^{\lvert V \rvert}exp (u_j^T\hat{υ})} \\ &= -u_c^T\hat{υ}+log ∑_{j=1}^{\lvert V \rvert}exp (u_j^T\hat{υ}) \end{aligned}$$

Skip-Gram

Use current word to predict its context.

We get the input word’s vector $υ_c$
Generate $2m$ score vectors, $u_{c-m},…,u_{c-1},…,u_{c+m}$.
Turn scores into probabilities $\hat{y}=softmax(u)$
We desire probabilities $\hat{y}$ match the true probabilities $y$.

$$\begin{aligned} minimize J &=-log P(ω_{c-m},…,ω_{c-1},ω_{c+1},…ω_{c+m}\lvert ω_c) &=-log ∏_{j=0,j≠ m}^{2m}P(ω_{c-m+j}\lvert ω_c)\\ &=-log ∏_{j=0,j≠ m}^{2m}P(u_{c-m+j}\lvert υ_c)\\ &=-log ∏_{j=0,j≠ m}^{2m}\frac{exp (u^T_{c-m+j}υ_c)}{∑_{k=1}^{\lvert V \rvert}{exp (u^T_k υ_c)}}\\ &=-∑_{j=0,j≠ m}^{2m}{u^T_{c-m+j}υ_c+2mlog ∑_{k=1}^{\lvert V \rvert} exp(u^T_k υ_c)} \end{aligned}$$

Architectures

Minimize $J$ is expensive, you need to calculate the probability of each word in vocabulary list. There are two ways to reduce the computation. Hierarchical Softmax and Negative Sampling.

Hierarchical Softmax

Encode words into a huffman tree, then each word has a Huffman code. The probability of it’s probability $P(w\lvert Context(ω))$ can change to choose the path from root to the leaf node, each node is a binary classification. Suppose code $0$ is a positive label, $1$ is negative label. If the probability of a positive classification is $$σ(X^T_ω θ)=\frac{1}{1+e^{-X^T_ω}}$$

Then the probability of negative classification is $$1-σ(X^T_ω θ)$$

足球’s Huffman code is $1001$, then it’s probability in each node are

$$\begin{aligned} p(d_2^ω\lvert X_ω,θ^ω_1&=1-σ(X^T_ω θ^ω_1)) p(d^ω_3\lvert X_ω,θ^ω_2&=σ(X^T_ω θ^ω_2))\\ p(d^ω_4\lvert X_ω,θ^ω_3&=σ(X^T_ω θ^ω_3))\\ p(d^ω_5\lvert X_ω,θ^ω_4&=1-σ(X^T_ω θ^ω_4))\\ \end{aligned}$$

where $θ$ is parameter in the node.

The probability of the 足球 is the production of these equation.

Generally,

$$p(ω\lvert Context(ω))=∏_{j=2}^{lω}p(d^ω_j\lvert X_ω,θ^ω_{j-1})$$

This reduce the calculation complexity to $log(n)$ instead of $n$

Negative Sampling

This method will choose some negative sample, then add the probability of the negative word into loss function. The optimisation target becomes maximise the positive words’ probability and minimise the negative words’ probability.

Let $P(D=0 \lvert ω,c)$ be the probability that $(ω,c)$ did not come from the corpus data. Then the objective function will be

$$θ = \text{argmax} ∏_{(ω,c)∈ D} P(D=1\lvert ω,c,θ) ∏_{(ω,c)∈ ˜{D}} P(D=0\lvert ω,c,θ)$$

where $θ$ is the parameters of the model($υ$ and $u$).

—

update 04-04-20

I found this two articles pretty useful: Language Models, Word2Vec, and Efficient Softmax Approximations and Word2vec from Scratch with NumPy.

Ref

[word2vec 原理推导与代码分析](http://www.hankcs.com/nlp/word2vec.html)
[CS 224D: Deep Learning for NLP Lecture Notes: Part I](http://cs224d.stanford.edu/lecture_notes/notes1.pdf)
[word2vec 中的数学原理详解（一）目录和前言](http://blog.csdn.net/itplus/article/details/37969519)

Parameters in doc2vec

Here are some parameter in gensim’s doc2vec class.

window

window is the maximum distance between the predicted word and context words used for prediction within a document. It will look behind and ahead.

In skip-gram model, if the window size is 2, the training samples will be this:(the blue word is the input word)

min_count

If the word appears less than this value, it will be skipped

sample

High frequency word like the is useless for training. sample is a threshold for deleting these higher-frequency words. The probability of keeping the word $w_i$ is:

$$P(w_i) = (\sqrt{\frac{z(ω_i)}{s}} + 1) ⋅ \frac{s}{z(ω_i)}$$

where $z(w_i)$ is the frequency of the word and $s$ is the sample rate.

This is the plot when sample is 1e-3.

negative

Usually, when training a neural network, for each training sample, all of the weights in the neural network need to be tweaked. For example, if the word pair is (‘fox’, ‘quick’), then only the word quick’s neurons should output 1, and all of the other word neurons should output 0.

But it would takes a lot of time to do this when we have billions of training samples. So, instead of update all of the weight, we random choose a small number of “negative” words (default value is 5) to update the weight.(Update their wight to output 0).

So when dealing with word pair (‘fox’,’quick’), we update quick’s weight to output 1, and other 5 random words’ weight to output 1.

The probability of selecting word $ω_i$ is $P(ω_i)$:

$$P(ω_i)=\frac{{f(ω_i)}^{{3}/{4}}}{∑_{j=0}^{n}\left({f(ω_j)}^{{3}/{4}}\right)}$$

$f(ω_j)$ is the frequency of word $ω_j$.

Ref

[Word2Vec Tutorial - The Skip-Gram Model](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)
[Word2Vec Tutorial Part 2 - Negative Sampling](http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/)

Semi-supervised text classification using doc2vec and label spreading

Here is a simple way to classify text without much human effort and get a impressive performance.

It can be divided into two steps:

Get train data by using keyword classification
Generate a more accurate classification model by using doc2vec and label spreading

Keyword-based Classification

Keyword based classification is a simple but effective method. Extracting the target keyword is a monotonous work. I use this method to automatic extract keyword candidate.

Find some most common words to classify the text.
Use this equation to calculate the score of each word appears in the text. $$ score(i) = \frac{count(i)}{all\_count(i)^{0.3}}$$ where $all\_count(i)$ is the word $i$’s word count in all corpus, and $count(i)$ is the word $i$’s word count in positive corpus.
Check the top words, add it to the final keyword list. Repeat this process.

Finally, we can use the keywords to classify the text and get the train data.

Classification by doc2vec and Label Spreading

Keyword-based classification sometimes produces the wrong result, as it can’t using the semantic information in the text. Fortunately, Google has open sourced word2vec, which can be used to produce semantically meaningful word embeddings. Furthermore, sentences can also be converted to vectors by using doc2vec. Sentences which has closed meaning also have short vector distance.

So the problem is how to classify these vectors.

Using corpus to train the doc2vec model.
Using doc2vec model to convert sentence into vector.
Using label spreading algorithm to train a classify model to classify the vectors.

TextCNN with PyTorch and Torchtext on Colab

PyTorch is a really powerful framework to build the machine learning models. Although some features is missing when compared with TensorFlow (For example, the early stop function, History to draw plot), its code style is more intuitive.

Torchtext is a NLP package which is also made by pytorch team. It provide a way to read text, processing and iterate the texts.

Google Colab is a Jupyter notebook environment host by Google, you can use free GPU and TPU to run your modal.

Here is a simple tutorial to build a TextCNN modal and run it on Colab.

The TextCNN paper was published by Kim in 2014. The model’s idea is pretty simple, but the performance is impressive. If you trying to solve the text classification problem, this model is a good choice to start with.

The main architecture is shown below:

It uses different kernels to extract text features, then use the softmax regression to classify text base on the features.

Now we can build this model step by step.

First build the model. The model I use is CNN-multichannel, which contains two sets of word embedding. Both of them is the copy of word embedding generate from corpus, but only one set will update embedding during training.

The code is below:

class textCNNMulti(nn.Module):
    def __init__(self,args):
        super().__init__()
        dim = args['dim']
        n_class = args['n_class']
        embedding_matrix=args['embedding_matrix']
        kernels=[3,4,5]
        kernel_number=[150,150,150]
        self.static_embed = nn.Embedding.from_pretrained(embedding_matrix)
        self.non_static_embed = nn.Embedding.from_pretrained(embedding_matrix, freeze=False)
        self.convs = nn.ModuleList([nn.Conv2d(2, number, (size, dim),padding=(size-1,0)) for (size,number) in zip(kernels,kernel_number)])
        self.dropout=nn.Dropout()
        self.out = nn.Linear(sum(kernel_number), n_class)

    def forward(self, x):
        non_static_input = self.non_static_embed(x)
        static_input = self.static_embed(x)
        x = torch.stack([non_static_input, static_input], dim=1)
        x = [F.relu(conv(x)).squeeze(3) for conv in self.convs]
        x = [F.max_pool1d(i, i.size(2)).squeeze(2) for i in x]
        x = torch.cat(x, 1)
        x = self.dropout(x)
        x = self.out(x)
        return x

Second, convert text into word index, so each sentence become a vector for training.

TEXT = data.Field(lower=True,batch_first=True)
LABEL = data.LabelField()

train, val, test = datasets.SST.splits(TEXT, LABEL, 'data/',fine_grained=True)

TEXT.build_vocab(train, vectors="glove.840B.300d")
LABEL.build_vocab(train,val,test)

train_iter, val_iter, test_iter = data.BucketIterator.splits(
    (train, val, test), batch_sizes=(128, 256, 256),shuffle=True)

Field defines how to process text, here is the most common parameters:

sequential – Whether the datatype represents sequential data. If False, no tokenization is applied. Default: True.

use_vocab – Whether to use a Vocab object. If False, the data in this field should already be numerical. Default: True.

preprocessing – The Pipeline that will be applied to examples using this field after tokenizing but before numericalizing. Many Datasets replace this attribute with a custom preprocessor. Default: None.

batch_first – Whether to produce tensors with the batch dimension first. Default: False.

datasets.SST.splits will load the SST datasets, and split into train, validation, and test Dataset objects.

build_vocab will create the Vocab object for Field, which contains the information to convert word into word index and vice versa. Also, the word embedding will save as Field.Vocab.vectors. vectors contains all of the word embedding. Torchtext can download some pretrained vectors automatically, such as glove.840B.300d, fasttext.en.300d. You can also load your vectors in this way, xxx.vec should be the standard word2vec format.

from torchtext.vocab import Vectors

vectors = Vectors(name='xxx.vec', cache='./')
TEXT.build_vocab(train, val, test, vectors=vectors)

data.BucketIterator.splits will returns iterators that loads batches of data from datasets, and the text in same batch has similar lengths.

Now, we can start to train the model. First we wrap some parameters into args, it contains settings like output class, learning rate, log interval and so on.

args={}
args['vocb_size']=len(TEXT.vocab)
args['dim']=300
args['n_class']=len(LABEL.vocab)
args['embedding_matrix']=TEXT.vocab.vectors
args['lr']=0.001
args['momentum']=0.8
args['epochs']=180
args['log_interval']=100
args['test_interval']=500
args['save_dir']='./'

Finally, we can train the model.

model=textCNNMulti(args)
model.cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=args['lr'],momentum=args['momentum'])
criterion = nn.CrossEntropyLoss()
steps=0
for epoch in range(1, args['epochs']+1):
    for i,data in enumerate(train_iter):
        steps+=1

        x, target = data.text, data.label
        x=x.cuda()

        target.sub_(1)
        target=target.cuda()

        output = model(x)
        loss = criterion(output, target)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

You can found textcnn.ipynb on GitHub or Colab.

Ref

Using Dueling DQN to Play Flappy Bird

PyTorch provide a simple DQN implementation to solve the cartpole game. However, the code is incorrect, it diverges after training (It has been discussed here).

The official code’s training data is below, it’s high score is about 50 and finally diverges.

There are many reason that lead to divergence.

First it use the difference of two frame as input in the tutorial, not only it loss the cart’s absolute information(This information is useful, as game will terminate if cart moves too far from centre), but also confused the agent when difference is the same but the state is varied.

Second, small replay memory. If the memory is too small, the agent will forget the strategy it has token in some state. I’m not sure whether 10000 memory is big enough, but I suggest using a higher value.

Third, the parameters. learning_rate, target_update_interval may cause fluctuation. Here is a example on Stack Overflow. I also met this problem when training cartpole agent. The reward stops growing after 1000 episode.

After doing some research on the cartpole DNQ code, I managed to made a model to play the flappy bird. Here are the changes from the original cartpole code. Most of the technology can be found in these two papers: Playing Atari with Deep Reinforcement Learning and Rainbow: Combining Improvements in Deep Reinforcement Learning.

Here is the model architecture:

Here is a trained result:

Dueling DQN
The vanilla DQN has the overestimate problem. As the max function will accumulate the noise when training. This leads to converging at suboptimal point. Two following architectures are submitted to solve this problem.

$$ Q(s, a) = r + γ max_{a’}[Q(s’, a’)] $$

Double DQN was published two year later DQN. It has two value function, one is used to choose the action with max Q value, another one is used to calculate the Q value of this action.

$$ a^{max}(S’_j, w) = arg\max_{a’}Q(φ(S’_j),a,w) $$

$$ Q(s,a) = r + γ Q’(φ(S’_j),a^{max}(S’_j, w),w’) $$

Dueling DQN is another solution. It has two estimator, one estimates the score of current state, another estimates the action score.

$$Q(s, a) = r + γ( max_{a’}[A(s’,a’)+V(s’)]$$

In order to distinguish the score of the actions, the return the Q-value will minus the mean action score:

x=val+adv-adv.mean(1,keepdim=True)

In this project, I use dueling DQN.

Image processing
I grayscale and crop the image.
Stack frames
I use the last 4 frame as the input. This should help the agent to know the change of environment.
Extra FC before last layer
I add a FC between the image features and the FC for calculate Q-Value.
Frame Skipping
Frame-skipping means agent sees and selects actions on every k frame instead of every frame, the last action is repeated on skipped frames. This method will accelerate the training procedure. In this project, I use frame_skipping=2, as the more the frame skipping is, the more the bird is likely to hit the pipe. And this method did help the agent to converge faster. More details can be found in this post.
Prioritized Experience Replay
This idea was published here. It’s a very simple idea: replay high TD error experience more frequently. My code implementation is not efficient. But in cartpole game, this technology help the agent converge faster.
Colab and Kaggle Kernel
My MacBook doesn’t support CUDA, so I use these two website to train the model. Here are the comparison of them. During training, Kaggle seems more stable, Colab usually disconnected after 1h.

Colab Kaggle Kernel

GPU Tesla T4(16G) Tesla P100(16G)

RAM 13G 13G

Max training time 12h 9h

Export trained model Google Drive -

—

The lesson I learnt from this project is patience. It takes a long time(maybe hundreds of thousand steps) to see whether this model works, and there are so many parameters can effect the final performance. It takes me about 3 weeks to build the final model. So if you want to build your own model, be patient and good luck. Here are two articles talking about the debugging and hyperparameter tuning in DQN:

Here are something may help with this task.

TensorBoard
It’s a visualization tool made by TensorFlow Team. It’s more convenient to use it rather than generate graph manually by matplotlib. Besides reward and mean_q, these variable are also useful when debugging: TD-error, loss and action_distribution, avg_priority.
Advanced image pre-processing
In this project, I just grayscalize the image. A more advance technology such as binarize should help agent to filter unimportant detail of game output.

In Flappy Bird RL, the author extract the vertical distance from lower pipe and horizontal distance from next pair of pipes as state. The trained agent can achieve 3000 score.

Other Improvements
Rainbow introduce many other extensions to enhance DQN, some of them have been discussed in this post.

I’ve uploaded code to this repo.

—

Update 26-04-19
Colab’s GPU has upgrade to Tesla T4 from K80, now it becomes my best bet.
Update 07-05-19
TensorBoard is now natively supported in PyTorch after version 1.1
Update 26-07-19
If you run out of RAM in Colab, it will show up an option to double the RAM.
Update 13-08-19
Upload video, update code.

Ref

PyTorch REINFORCEMENT LEARNING (DQN) TUTORIAL
强化学习 (A series of Chinese post about reinforcement learning)
Deep Reinforcement Learning for Flappy Bird
Flappy-Bird-Double-DQN-Pytorch
DeepRL-Tutorials
Speeding up DQN on PyTorch: how to solve Pong in 30 minutes
Frame Skipping and Pre-Processing for Deep Q-Networks on Atari 2600 Games
OpenAI Baselines: DQN
Deep-Reinforcement-Learning-Hands-On
DQN solution results peak at ~35 reward

Different types of Attention

$s_t$ and $h_i$ are source hidden states and target hidden state, the shape is (n,1). $c_t$ is the final context vector, and $α_{t,s}$ is alignment score.

$$\begin{aligned} c_t&=∑_{i=1}^n α_{t,s}h_i α_{t,s}&= \frac{exp(score(s_t,h_i))}{∑_{i=1}^n exp(score(s_t,h_i))} \end{aligned}$$

Global(Soft) VS Local(Hard)

Global Attention takes all source hidden states into account, and local attention only use part of the source hidden states.

Content-based VS Location-based

Content-based Attention uses both source hidden states and target hidden states, but location-based attention only use source hidden states.

Here are several popular attention mechanisms:

Dot-Product

$$score(s_t,h_i)=s_t^Th_i$$

Scaled Dot-Product

$$score(s_t,h_i)=\frac{s_t^Th_i}{\sqrt{n}}$$ where n is the vectors dimension. Google’s Transformer model has similar scaling factor when calculate self-attention: $score=\frac{KQ^T}{\sqrt{n}}$

Location-Base

$$socre(s_t,h_i)=softmax(W_as_t)$$

General

$$score(s_t,h_i)=s_t^TW_ah_i$$

$Wa$’s shape is (n,n)

Concat

$$score(s_t,h_i)=v_a^Ttanh(W_a[s_t,h_i])$$

$v_a$’s shape is (x,1), and $Wa$ ‘s shape is (x,x). This is similar to a neural network with one hidden layer.

When I doing a slot filling project, I compare these mechanisms. Concat attention produce the best result.

Ref

The Annotated The Annotated Transformer

Thanks for the articles I list at the end of this post, I understand how transformers works. These posts are comprehensive, but there are some points that confused me.

First, this is the graph that was referenced by almost all of the post related to Transformer.

Transformer consists of these parts: Input, Encoder*N, Output Input, Decoder*N, Output. I’ll explain them step by step.

Input

The input word will map to 512 dimension vector. Then generate Positional Encoding(PE) and add it to the original embeddings.

Positional Encoding

The transformer model does not contains recurrence and convolution. In order to let the model capture the sequence of input word, it add PE into embeddings.

PE will generate a 512 dimension vector for each position:

$$\begin{align*} PE_{(pos,2i)} = sin(pos / 10000^{2i/d_{model}})
PE_{(pos,2i+1)} = cos(pos / 10000^{2i/d_{model}}) \end{align*}$$ The even and odd dimension use sin and cos function respectively.

For example, the second word’s PE should be: $sin(2 / 10000^{0 / 512}), cos(2 / 10000^{0 / 512}), sin(2 / 10000^{2 / 512}), cos(2 / 10000^{2 / 512})\text{…}$

The value range of PE is (-1,1), and each position’s PE is slight different, as cos and sin has different frequency. Also, for any fixed offset k, $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$.

For even dimension, let $10000^{2i/d_{model}}$ be $α$, for even dimension:

$$\begin{aligned} PE_{pos+k}&=sin((pos+k)/α) &=sin(pos/α)cos(k/α)+cos(pos/α)sin(k/α)\\ &=PE_{pos\_even}K_1+PE_{pos\_odd}K_2 \end{aligned}$$

The PE implementation in tensor2tensor use sin in first half of dimension and cos in the rest part of dimension.

Encoder

There are 6 Encoder layer in Transformer, each layer consists of two sub-layer: Multi-Head Attention and Feed Forward Neural Network.

Multi-Head Attention

Let’s begin with single head attention. In short, it maps word embeddings to q k v and use q k v vector to calculate the attention.

The input words map to q k v by multiply the Query, Keys Values matrix. Then for the given Query, the attention for each word in sentence will be calculated by this formula: $\mathrm{attention}=\mathrm{softmax}(\frac{qk^T}{\sqrt{d_k}})v$, where q k v is a 64 dimension vector.

Matrix view:

$Attention(Q, K, V) = \mathrm{softmax}(\frac{(XW^Q)(XW^K)^T}{\sqrt{d_k}})(XW^V)$ where $X$ is the input embedding.

The single head attention only output a 64 dimension vector, but the input dimension is 512. How to transform back to 512? That’s why transformer has multi-head attention.

Each head has its own $W^Q$ $W^K$ $W^V$ matrix, and produces $Z_0,Z_1…Z_7$,($Z_0$’s shape is (512, 64)) the concat the outputted vectors as $O$. $O$ will multiply a weight matrix $W^O$ ($W^O$’s shape is (512, 512)) and the result is $Z$, which will be sent to Feed Forward Network.

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

The whole procedure looks like this:

Add & Norm

This layer works like this line of code: norm(x+dropout(sublayer(x))) or x+dropout(sublayer(norm(x))). The sublayer is Multi-Head Attention or FF Network.

Layer Normalization

Layer Norm is similar to Batch Normalization, but it tries to normalize the whole layer’s features rather than each feature.(Scale and Shift also apply for each feature) More details can be found in this paper.

Position-wise Feed Forward Network

This layer is a Neural Network whose size is (512, 2048, 512). The exact same feed-forward network is independently applied to each position.

Output Input

Same as Input.

Decoder

The decoder is pretty similar to Encoder. It also has 6 layers, but has 3 sublayers in each Decoder. It add a masked multi-head-attention at the beginning of Decoder.

Masked Multi-Head Attention

This layer is used to block future words during training. For example, if the output is <bos> hello world <eos>. First, we should use <bos> as input to predict hello, hello world <eos> will be masked to 0.

Key and Value in Decoder Multi-Head Attention Layer

In Encoder, the q k v vector is generated by $XW^Q$, $XW^K$ and $XW^V$. In the second sub-layer of Decoder, q k v was generated by $XW^Q$, $YW^K$ and $YW^V$, where $Y$ is the Encoder’s output, $X$ is the <init of sentence> or previous output.

The animation below illustrates how to apply the Transformer to machine translation.

Output

Using a linear layer to predict the output.

Ref

Near-duplicate with SimHash

Before talking about SimHash, let’s review some other methods which can also identify duplication.

Longest Common Subsequence(LCS)

This is the algorithm used by diff command. It is also edit distance with insertion and deletion as the only two edit operations.

This works good for short strings. However, the algorithm’s time complexity is $O(m*n)$, if two strings’ lengths are $m$ and $n$ respectively. So it’s not suitable for large corpus. Also, if two corpus consists of same paragraph but the order is not same. LCS treat them as different corpus, and that’s not we expected.

Bag of Words(BoW)

Transform document into the words it contains, then using Jaccard Similarity to calculate the similarity.

For example, if document A contains {a,b,c} and B contains {a,b,d}, then $$Similarity = \frac{A ∩ B}{A ∩ B} = \frac{\{a,b\}}{\{a,b,c,d\}}=\frac{1}{2}$$

Shingling (n-gram)

BoW drops the word context information. In order to take word context into consideration, we convert sentences into phrases. For instance, roses are red and violets are blue will convert to roses are red, are red and, red and voilets …

Hashing

Saving shingling result take k times disk space if using k words phrase. To solve this problem, save phrase’s hashing value instead of string.

MinHash

The larger the document is, the more the hashing needs to compare. Is there a way to map documents to constant value? MinHash tackles this problem.

It uses $k$ hashing functions to calculate the phrase hashes. Then for each hashing function, using the minimal hashing result as signature. Finally, we get $k$ hashing value as document’s signature. The procedure is shown below.

Compare with Hashing, MinHash successfully reduce the time complexity and storage complexity to $O(1)$, an improvement over $O(m+n)$ and $O(n)$, where n is the phrase number, m is the phrase number to compare.

SimHash

For a given document, how to find it’s most similar document? If using MinHash, we need to travel the whole corpus. Is there any more effective method? SimHash comes to the rescue.

For a set of input hashes, SimHash will generate a fingerprint(f-bits vector) for the input And the produced hashes has a property: similar input hashes generate similar fingerprint. So the dissimilarity of two documents can be calculated by the XOR of two fingerprint. In google’s Detecting Near-Duplicates for Web Crawling paper, they map 8B web-pages to 64 bits. If two bits differ less than 3 bits, then two web-pages are similar.

The calculation of SimHash is quiet simple. Given a set of features extracted from the document and their weights, we’ll maintain f-bits vector $V$, and initialize it to zero. Each feature will also hash to f-bit value $V_i$. Then each dimension of $V_i$ will multiply by it’s weight $W_i$ and add this new value to $V$. If i-th bits if 1, then $V$ is incremented by the weight of that feature. Otherwise $V$ is decremented by the weight. When all features have been processed, $V$ contains positive and negative dimension. Mapping positive values to 1 and negative numbers to 0 to get the final hash value.

$$V = zero\_or\_one(∑{W_i*inc\_or\_dec(V_i)})$$

How to generate features from document

One easy way to do this is to use a window to get sub-string from document. For each sub-string, using the hash value of string as features, and the count of this string as weight.

For example, if we has this sentence: kk really rocks!.

First, pre-processing this sentence to kkreallyrocks.

Then using a window of 4 to generate sub-string from the sentence. We’ll get the sub-string and their count: (kkre, 1), (krea, 1), (real, 1) etc.

Suppose we only get these first 3 sub-string and their hash values are 1001, 0101 and 1101 respectively. Then the final $V$ should be 1101

How to find similar document

Iterating over all document and compare with target simhash value is a time consuming operation. Is there any smart way to accomplish this task? In Google’s paper, they published a very neat algorithm.

If the hash value is a 64-bit vector, and we want to find the document which is 2-bit differs with the target. Then we can divided the vector to 4 part: $A$, $B$, $C$ and $D$. Then we know that at least two part should be the identical.

Suppose part $A$ and $B$ is identical, if we have sorted the hash by $ABCD$ order, we can easily find all hash that $AB$ part is identical. Then we can compare the rest part $B$ and $C$ and find hash vectors that differs from target at most 2 bit. If you have 8B($2^{34}$) document and documents are distributed uniformly at random, on average, you only need to compare $2^{34-32}=4$ fingerprints.

Besides $AB$, $AC$, $AD$, $BC$, $BD$ and $CD$ may also be identical. So you need to keep $C_4^2=6$ sorted list, and compare 4 fingerprints in each list. You don’t need to compare 8B documents anymore, that’s a great improvement.

Depending on the fingerprints’ bit and documents number, you need to find a optimal number to split the hash value.

Ref

Programming

Create Node Benchmark in Py2neo

Recently, I’m working on a neo4j project. I use Py2neo to interact with graph db. Although Py2neo is a very Pythonic and easy to use, its performance is really poor. Sometimes I have to manually write cypher statement by myself if I can’t bear with the slow execution. Here is a small script which I use to compare the performance of 4 different ways to insert nodes.

import time

from graph_db import graph

from py2neo.data import Node, Subgraph


def delete_label(label):
    graph.run('MATCH (n:{}) DETACH DELETE n'.format(label))


def delete_all():
    print('delete all')
    graph.run('match (n) detach delete n')


def count_label(label):
    return len(graph.nodes.match(label))


def bench_create1():
    print('Using py2neo one by one')
    delete_label('test')
    start = time.time()
    tx = graph.begin()
    for i in range(100000):
        n = Node('test', id=i)
        tx.create(n)
    tx.commit()
    print(time.time() - start)
    print(count_label('test'))
    delete_label('test')


def bench_create2():
    print('Using cypher one by one')
    delete_label('test')
    start = time.time()
    tx = graph.begin()
    for i in range(100000):
        tx.run('create (n:test {id: $id})', id=i)
        if i and i % 1000 == 0:
            tx.commit()
            tx = graph.begin()
    tx.commit()
    print(time.time() - start)
    print(count_label('test'))
    delete_label('test')


def bench_create3():
    print('Using Subgraph')
    delete_label('test')
    start = time.time()
    tx = graph.begin()
    nodes = []
    for i in range(100000):
        nodes.append(Node('test', id=i))
    s = Subgraph(nodes=nodes)
    tx.create(s)
    tx.commit()
    print(time.time() - start)
    print(count_label('test'))
    delete_label('test')



def bench_create4():
    print('Using unwind')
    delete_label('test')
    start = time.time()
    tx = graph.begin()
    ids = list(range(100000))
    tx.run('unwind $ids as id create (n:test {id: id})', ids=ids)
    tx.commit()
    print(time.time() - start)
    print(count_label('test'))
    delete_label('test')


def bench_create():
    create_tests = [bench_create1, bench_create2, bench_create3, bench_create4]

    print('testing create')
    for i in create_tests:
        i()


if __name__ == '__main__':
    bench_create()

Apparently, using cypher with unwind keyword is the fastest way to batch insert nodes.

testing create
Using py2neo one by one
96.09799289703369
100000
Using cypher one by one
9.493892192840576
100000
Using Subgraph
7.638832092285156
100000
Using unwind
2.511630058288574
100000

The above result is based on http protocol. A very interesting result is that, bolt protocol will decrease the time of the first method, but double the time of second method. That’s wired, maybe py2neo has some special optimization when doing batch insert on bolt protocol? But I have no idea why insert one by one with cypher is 2x slower. Here is the result of bolt protocol.

testing create
Using py2neo one by one
51.73185706138611
100000
Using cypher one by one
22.051995992660522
100000
Using Subgraph
8.81674599647522
100000
Using unwind
2.8623900413513184
100000

Deploy Nikola Org Mode on Travis

Recently, I enjoy using Spacemacs, so I decided to switch to org file from Markdown for writing blog. After several attempts, I managed to let Travis convert org file to HTML. Here are the steps.

Install Org Mode plugin

First you need to install Org Mode plugin on your computer following the official guide: Nikola orgmode plugin.

Edit `conf.el`

Org Mode will convert to HTML to display on Nikola. Org Mode plugin will call Emacs to do this job. When I run nikola build, it shows this message: Please install htmlize from https://github.com/hniksic/emacs-htmlize. I’m using Spacemacs, the htmlize package is already downloaded if the org layer is enabled. I just need to add htmlize folder to load-path. So here is the code:

(setq dir "~/.emacs.d/elpa/27.0/develop/")
(if(file-directory-p dir)
    (let ((default-directory dir))
      (normal-top-level-add-subdirs-to-load-path)))
(require 'htmlize)

This package is also needed on Travis, the similar approach is required.

Modify `.travis.yml`

Travis is using ubuntu 14.04, and the default Emacs version is 24, and the Org Mode version is below 8.0, which not match the requirements. The easiest solution is to update Emacs to 25. So in the before_install section, add these code:

- sudo add-apt-repository ppa:kelleyk/emacs -y
- sudo apt-get update

In the install section, add these code:

- sudo apt-get remove emacs
- sudo apt autoremove
- sudo apt-get install emacs25

The default emacs doesn’t contains htmlize package. So add git clone https://github.com/hniksic/emacs-htmlize ~/emacs-htmlize into before_install section.

Finally, modify conf.el for Travis Emacs, add GitHub repo to load-path: (add-to-list 'load-path "~/emacs-htmlize/")

Voila, the org file should show up.

The full .travis.yml is below:

language: python
cache: apt
sudo: false
addons:
  apt:
    packages:
    - language-pack-en-base
branches:
  only:
  - src
python:
- 3.6
before_install:
- sudo add-apt-repository ppa:kelleyk/emacs -y
- sudo apt-get update
- openssl aes-256-cbc -K $encrypted_a5c638e4bedc_key -iv $encrypted_a5c638e4bedc_iv
  -in travis.enc -out travis -d
- git config --global user.name 'bebound'
- git config --global user.email 'bebound@gmail.com'
- git config --global push.default 'simple'
- pip install --upgrade pip wheel
- echo -e 'Host github.com\n    StrictHostKeyChecking no' >> ~/.ssh/config
- eval "$(ssh-agent -s)"
- chmod 600 travis
- ssh-add travis
- git remote rm origin
- git remote add origin git@github.com:bebound/bebound.github.io
- git fetch origin master
- git branch master FETCH_HEAD
- git clone https://github.com/hniksic/emacs-htmlize ~/emacs-htmlize
install:
- pip install 'Nikola[extras]'==7.8.15
- sudo apt-get remove emacs
- sudo apt autoremove
- sudo apt-get install emacs25
script:
- nikola build && nikola github_deploy -m 'Nikola auto deploy [ci skip]'
notifications:
  email:
    on_success: change
    on_failure: always

And here is the conf.el:

(setq dir "~/.emacs.d/elpa/27.0/develop/")
(if(file-directory-p dir)
    (let ((default-directory dir))
      (normal-top-level-add-subdirs-to-load-path)))
(add-to-list 'load-path "~/emacs-htmlize/")
(require 'htmlize)

Enable C Extension for gensim on Windows

These days, I’m working on some text classification works, and I use gensim ’s doc2vec function.

When using gensim, it shows this warning message:

C extension not loaded for Word2Vec, training will be slow.

I search this on Internet and found that gensim has rewrite some part of the code using cython rather than numpy to get better performance. A compiler is required to enable this feature.

I tried to install mingw and add it into the path, but it’s not working.

Finally, I tried to install Visual C++ Build Tools and it works.

If this output is not -1, then it’s fine.

from gensim.models import word2vec
print(word2vec.FAST_VERSION)

Using Chinese Characters in Matplotlib

After searching from Google, here is easiest solution. This should also works on other languages:

import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import matplotlib.font_manager as fm
f = "/System/Library/Fonts/PingFang.ttc"
prop = fm.FontProperties(fname=f)

plt.title("你好",fontproperties=prop)
plt.show()

Output:

Program Crash Caused by CPU Instruction

It’s inevitable to dealing with bugs in coding career. The main part of coding are implementing new features, fixing bugs and improving performance. For me, there are two kinds of bugs that is difficult to tackle: those are hard to reproduce, and those occur in code not wrote by you.

Recently, I met a bug which has both features mentioned before. I write a Spark program to analyse the log and cluster them. Last week I update the code, use Facebook’s faiss library to accelerate the process of find similar vector. After I push the new code to spark, the program crashed. I found this log on Spark driver:

java.io.EOFException
ERROR PythonRunner: Python worker exited unexpectedly (crashed).

Because the Python Worker is created by Spark JVM, I can’t get the internal state of Python Worker. By inserting log into Code, I get the rough position of crash code. But the code looks good.

I have tested the code on my develop environment. My develop machine is Using Spark 2.4. but the Spark platform is using Spark 3.0. I guess maybe there is some compatible problem on Spark 3.0. So I use the same docker images as Spark platform to run the code. The code works as expected without crash. That’s wired, the docker has isolate the environment, how could same docker image produce different output?

I search the error from google, some said it’s because spark is running out of memory. This doesn’t seem correct, this update shouldn’t increase the RAM usage. I still gave it a try and no luck.

Alright, this update add faiss to the code, maybe faiss lead to the crash, as Python doesn’t raise any other. If the crash is caused by the C code in faiss, this makes sense. First, I write a code with spark and faiss, the program crashed. Then I wrote a code only contains faiss, it still crashed. So I can confirm that the crash is cause by faiss and Spark is innocent. Even stranger, when running on Spark platform, sometimes the script crashes, sometimes not.

But why faiss only crash on the Spark Platform? I ask the colleague to know the detail of the failed job and know that the docker’s exit code is 132. 132 means illegal instruction. I search illegal instruction on faiss’s GitHub issue. I found this issue: Illegal instruction (core dumped).

By compare the host server’s CPU instruction. The crashed ones lack of avx2 instruction. avx2 is added after the Intel Fourth generation core (Haswell). The develop server is using sixth generation CPU, and some platform server is too to support this instruction. By adding a parameter to enforce the script scheduling on new server, the crash disappears.

PS: Running faiss code index.add(xx) will not trigger the crash, but calling faiss.search(xx) does. When I trying to locate the code which cause the crash, the faiss package was imported correctly and the index is built normally. This mislead me to believe that faiss code is working.

Difference between Value and Pointer variable in Defer in Go

defer is a useful function to do cleanup, as it will execute in LIFO order before the surrounding function returns. If you don’t know how it works, sometimes the execution result may confuse you.

How it Works and Why Value or Pointer Receiver Matters

I found an interesting code on Stack Overflow:

type X struct {
    S string
}

func (x X) Close() {
    fmt.Println("Value-Closing", x.S)
}

func (x *X) CloseP() {
    fmt.Println("Pointer-Closing", x.S)
}

func main() {
    x := X{"Value-X First"}
    defer x.Close()
    x = X{"Value-X Second"}
    defer x.Close()

    x2 := X{"Value-X2 First"}
    defer x2.CloseP()
    x2 = X{"Value-X2 Second"}
    defer x2.CloseP()

    xp := &X{"Pointer-X First"}
    defer xp.Close()
    xp = &X{"Pointer-X Second"}
    defer xp.Close()

    xp2 := &X{"Pointer-X2 First"}
    defer xp2.CloseP()
    xp2 = &X{"Pointer-X2 Second"}
    defer xp2.CloseP()
}

The output is:

Pointer-Closing Pointer-X2 Second
Pointer-Closing Pointer-X2 First
Value-Closing Pointer-X Second
Value-Closing Pointer-X First
Pointer-Closing Value-X2 Second
Pointer-Closing Value-X2 Second
Value-Closing Value-X Second
Value-Closing Value-X First

Take a look at line 5-6, why Pointer-Closing Value-X2 Second was printed twice? According to Effective Go, ”The arguments to the deferred function (which include the receiver if the function is a method) are evaluated when the defer executes, not when the call executes.”. And the function’s parameters will saved anew when evaluated.

As x2 is value and the defer function CloseP’s receiver is a pointer, once defer executes, it will create a pointer which points to x2 as function’s caller. In the following defer, it will create a pointer which point to x2 again. Although x2.S change to “Second”, x2’s address never changes. Finally, when these two defer is called, the same log was printed again.

How to Exit Program and Run all Defer

From Golang Runtime:

runtime.Goexit() terminates the goroutine that calls it. No other goroutine is affected. Goexit runs all deferred calls before terminating the goroutine. Because Goexit is not a panic, any recover calls in those deferred functions will return nil.

Calling Goexit from the main goroutine terminates that goroutine without func main returning. Since func main has not returned, the program continues execution of other goroutines. If all other goroutines exit, the program crashes.

If you want the program to exit normally, just add defer os.Exit(0) at the top of main function. Here is the example code:

package main

import (
	"fmt"
	"os"
	"runtime"
	"time"
)

func subGoroutine() {
	defer fmt.Println("exit sub routine")
	for {
		fmt.Println("sub goroutine running")
		time.Sleep(1 * time.Second)
	}
}

func main() {
	defer os.Exit(0)
	defer fmt.Println("calling os.Exit")

	go subGoroutine()

	time.Sleep(2 * time.Second)
	runtime.Goexit()
}

Output:

sub goroutine running
sub goroutine running
sub goroutine running
calling os.Exit

Process finished with exit code 0

The defer code in main goroutine are executed, but those in subGoroutine will not be executed. As os.Exit will

Exit causes the current program to exit with the given status code. Conventionally, code zero indicates success, non-zero an error. The program terminates immediately; deferred functions are not run.

from godoc

Ref

Python

CSRF in Django

CSRF(Cross-site request forgery) is a way to generate fake user request to target website. For example, on a malicious website A, there is a button, click it will send request to www.B.com/logout. When the user click this button, he will logout from website B unconsciously. Logout is not a big problem, but malicious website can generate more dangerous request like money transfer.

Django CSRF protection

Each web framework has different approach to do CSRF protection. In Django, the validation process is below:

When user login for the first time, Django generate a csrf_secret, add random salt and encrypt it as A, save A to cookie csrftoken.
When Django processing tag {{ csrf_token }} or {% csrf_token %}, it read csrftoken cookie A, reverse it to csrf_secret, add random salt and encrypt it as B, return corresponding HTML.
When Django receive POST request, it will retrieve cookie csrftoken as A, and tries to get csrfmiddlewaretoken value B from POST data, if it does not exist, it will get header X-CSRFToken value as B. Then A and B will be reversed to csrf_secret. If the values are identical, the validation is passed. Otherwise, a 403 error will raise.

Django CSRF Usage

Form

<form>
    {% csrf_token %}
</form>

Single AJAX request

$.ajax({
    data: {
        csrfmiddlewaretoken: '{{ csrf_token }}'
    },

Multiple AJAX request

Extracting csrftoken from cookie and add it to header for each ajax request.

function getCookie(name) {
    var cookieValue = null;
    if (document.cookie && document.cookie !== '') {
        var cookies = document.cookie.split(';');
        for (var i = 0; i < cookies.length; i++) {
            var cookie = jQuery.trim(cookies[i]);
            // Does this cookie string begin with the name we want?
            if (cookie.substring(0, name.length + 1) === (name + '=')) {
                cookieValue = decodeURIComponent(cookie.substring(name.length + 1));
                break;
            }
        }
    }
    return cookieValue;
}
var csrftoken = getCookie('csrftoken');

function csrfSafeMethod(method) {
    // these HTTP methods do not require CSRF protection
    return (/^(GET|HEAD|OPTIONS|TRACE)$/.test(method));
}
$.ajaxSetup({
    beforeSend: function(xhr, settings) {
        if (!csrfSafeMethod(settings.type) && !this.crossDomain) {
            xhr.setRequestHeader("X-CSRFToken", csrftoken);
        }
    }
});

Ref

Python Dictionary Implementation

Overview

CPython allocation memory to save dictionary, the initial table size is 8, entries are saved as <hash,key,value> in each slot(The slot content changed after Python 3.6).
When a new key is added, python use i = hash(key) & mask where mask=table_size-1 to calculate which slot it should be placed. If the slot is occupied, CPython using a probing algorithm to find the empty slot to store new item.
When 2/3 of the table is full, the table will be resized.
When getting item from dictionary, both hash and key must be equal.

Resizing

When elements size is below 50000, the table size will increase by a factor of 4 based on used slots. Otherwise, it will increase by a factor of 2. The dictionary size is always $2^{n}$.

dict size	resize when elements in dict	new table size
8	6	32
32	22	128
128	86	512

Removing item from dictionary doesn’t lead to shrink table. The value of the item will marks as null but not empty. When looking up element in dictionary, it will keep probing once find this special mark. So deleting element from Python will not decrease the memory using. If you really want to do so, you can the items in the old dictionary to create a new one.

Probing

CPython used a modified random probing algorithm to choose the empty slot. This algorithm can traval all of the slots in a pseudo random order.

The travel order can be calculated by this formula: j = ((5*j) + 1) mod 2**i, where j is slot index.

For example, if table size is 8, and the calculate slot index is 2, then the traversal order should be:

2 -> (5*2+1) mod 8 = 3 -> (5*3+1) mod 8 = 0 -> (5*0+1) mod 8 = 1 -> 6 -> 7 -> 4 -> 5 -> 2

CPython changed this formula by adding perturb and PERTURB_SHIFT variables, where perturb is hash value and PERTURB_SHIFT is 5. By adding PERTURB_SHIFT, the probe sequence depends on every bit in the hash code, and the collision probability is decreased. And perturb will eventually becomes to 0, this ensures that all of the slots will be checked.

j = (5*j) + 1 + perturb;
perturb >>= PERTURB_SHIFT;
j = j % 2**i

Dictionary improvement after 3.6

CPython 3.6 use a compact representation to save entries, and “The memory usage of the new dict() is between 20% and 25% smaller compared to Python 3.5”.

Compact Hash Table

As mentioned before, entries saved in the form of <hash,key,value>. This will takes 3B on 64 bit machine. And no matter how much item is added into the dictionary, the memory usage is the same(3B*table_size).

After 3.6, CPython use two structure to save data. One is index, another is the real data.

For example, if the table size is 8, and there is an item in slot 1, the index looks like this:

[null, 0, null, null, null, null, null, null]

And the real data is:

| hash | key  | value |
| xxx1 | yyy1 | zzz1  |

0 represents the items index on real data. If another item is added in slot 3, the new index become this:

[null, 0, null, 1, null, null, null, null]

The real data become this:

| hash | key  | value |
| xxx1 | yyy1 | zzz1  |
| xxx2 | yyy2 | zzz2  |

This saves memory, especially when table load factor is low.

Order preserving

Since the index table records the order of items, so the entries order is preserved. This feature is now part of the language spec since Python 3.7.

Ref

Circular Import in Python

Recently, I found a really good example code for Python circular import, and I’d like to record it here.

Here is the code:

# X.py
def X1():
    return "x1"

from Y import Y2

def X2():
    return "x2"

# Y.py
def Y1():
    return "y1"

from X import X1

def Y2():
    return "y2"

Guess what will happen if you run python X.py and python Y.py?

Here is the answer, the first one outputs this:

Traceback (most recent call last):
  File "X.py", line 4, in <module>
    from Y import Y2
  File "/Users/kk/Y.py", line 4, in <module>
    from X import X1
  File "/Users/kk/X.py", line 4, in <module>
    from Y import Y2
ImportError: cannot import name Y2

The second one runs normally.

If this is the same as you thought, you already know how python import works. You don’t need to read this post.

Python import machinery

When Python imports a module for the first time, it create a new module object and set sys.modules[module_name]=module object , then executes execute in module object to define its content. If you import that module again, Python will just return the object save in sys.modules.

In X.py line 5, Python add Y into sys.modules and start execute code in Y.py. In Y.xy line5, it pause import Y, add X into sys.modules, and execute code X.py. Back to X.py line5, Python find Y in sys.modules and try to import Y2 in Y. But Y2 is not yet defined, so the ImportError was raised.

How to fix

Change import order.
Wrap function call related to other module into configure function, call it manually.
Dynamic import(use import within a function).

Ref

Torchtext snippets

Load separate files

data.Field parameters is here.

When calling build_vocab, torchtext will add <unk> in vocabulary list. Set unk_token=None if you want to remove it. If sequential=True (default), it will add <pad> in vocab. <unk> and <pad> will add at the beginning of vocabulary list by default.

LabelField is similar to Field, but it will set sequential=False, unk_token=None and is_target=Ture

INPUT = data.Field(lower=True, batch_first=True)
TAG = data.LabelField()

train, val, test = data.TabularDataset.splits(path=base_dir.as_posix(), train='train_data.csv',
                                                validation='val_data.csv', test='test_data.csv',
                                                format='tsv',
                                                fields=[(None, None), ('input', INPUT), ('tag', TAG)])

Load single file

all_data = data.TabularDataset(path=base_dir / 'gossip_train_data.csv',
                               format='tsv',
                               fields=[('text', TEXT), ('category', CATEGORY)])
train, val, test = all_data.split([0.7, 0.2, 0.1])

Create iterator

train_iter, val_iter, test_iter = data.BucketIterator.splits(
    (train, val, test), batch_sizes=(32, 256, 256), shuffle=True,
    sort_key=lambda x: x.input)

Load pretrained vector

vectors = Vectors(name='cc.zh.300.vec', cache='./')

INPUT.build_vocab(train, vectors=vectors)
TAG.build_vocab(train, val, test)

Check vocab sizes

You can view vocab index by vocab.itos.

tag_size = len(TAG.vocab)

Use field vector in model

vec = INPUT.vocab.vectors

class Model:
    nn.Embedding.from_pretrained(vec, freeze=False)

Convert text to vector

s = ' '.join(segmentize(s))
s = INPUT.preprocess(s)
vec = INPUT.process([s])

C3 Linearization and Python MRO(Method Resolution Order)

Python supports multiple inheritance, its class can be derived from more than one base classes. If the specified attribute or methods was not found in current class, how to decide the search sequence from superclasses? In simple scenario, we know left-to right, bottom to up. But when the inheritance hierarchy become complicated, it’s not easy to answer by intuition.

For instance, what’s search sequence of class M?

class X:pass
class Y: pass
class Z:pass
class A(X,Y):pass
class B(Y,Z):pass
class M(B,A,Z):pass

The answer is: M, B, A, X, Y, Z, object

C3 Algorithm

How did Python generate this sequence? After Python 2.3, it use C3 Linearization algorithm.

C3 follows these two equation:

L[object] = [object]
L[C(B1…BN)] = [C] + merge(L[B1]…L[BN], [B1, … ,BN])

L[C] is the MRO of class C, it will evaluate to a list.

The key process is merge, it get a list and generate a list by this way:

First, check the first list’s head element(L[B1]) as H.
If H is not in the tail of other list, output it, and remove it from all of the list, then go to step 1. Otherwise, check the next list’s head as H, go to step 2. (tail means the rest of the list except the first element)
If merge’s list is empty, end algorithm. If list is not empty but not able to find element to output, raise error.

That seems complicated, I’ll use the previous example again to explain the calculation of C3.

Let’s begin with the easy ones. Firstly, calculate A’s MRO:

L[A(X,Y)]=[A]+merge(L[X],L[Y],[X,Y])
         =[A]+merge([X,obj],[Y,obj],[X,Y])
         # X is not tail of other list, use it as H
         =[A,X]+merge([obj],[Y,obj],[Y])  
         # obj is in the tail of[Y.obj], use Y as H
         =[A,X,Y]+merge([obj],[obj]]
         =[A,X,Y,obj]

B’s MRO [B,Y,Z,obj] and Z’s MRO [z,obj] can also be calculated.

Now we can get M’s MRO:

L[M(B,A,Z)]=[M]+merge(L[B],L[A],L[Z],[B,A,Z])
         =[M]+merge([B,Y,Z,obj],[A,X,Y,obj],[Z,obj],[B,A,Z])
         =[M,B]+merge([Y,Z,obj],[A,X,Y,obj],[Z,obj],[A,Z])
         # Y is in the tail of [A,X,Y,obj], use A as H
         =[M,B,A]+merge([Y,Z,obj],[X,Y,obj],[Z,obj],[Z])
         # Y is in the tail of [X,Y,obj], use X as H
         =[M,B,A,X]+merge([Y,Z,obj],[Y,obj],[Z,obj],[Z])
         =[M,B,A,X,Y]+merge([Z,obj],[obj],[Z,obj],[Z])
         =[M,B,A,X,Y,Z]+merge([obj],[obj],[obj])
         =[M,B,A,X,Y,Z,obj]

MRO and super()

super also use C3 to find the inherited method to execute.

For instance, C’s MRO is C,A,B,Base,obj, so after enter A, it will output enter B rather than enter base.

class Base:
    def __init__(self):
        print('enter base')
        print('leave base')


class A(Base):
    def __init__(self):
        print('enter A')
        super(A, self).__init__()
        print('leave A')


class B(Base):
    def __init__(self):
        print('enter B')
        super(B, self).__init__()
        print('leave B')


class C(A, B):
    def __init__(self):
        print('enter C')
        super(C, self).__init__()
        print('leave C')

c = C()

enter C
enter A
enter B
enter base
leave base
leave B
leave A
leave C

super works like this, it will get inst’s MRO, find cls’s index, return next class in MRO. (In python3, super(A,self) can be write as super())

def super(cls, inst):
    mro = inst.__class__.mro()
    return mro[mro.index(cls) + 1]

When running this line super(C, self).__init__(), self is C’s instance, mro is:

[<class '__main__.C'>, <class '__main__.A'>, <class '__main__.B'>, <class '__main__.Base'>, <class 'object'>]

So it returns A, and A will execute __init__(), then calling super(A, self).__init__(), end enter B’s __init__(). (C’s instance inst will pass as self in the calling chain.)

Ref

Import custom package or module in PySpark

First zip all of the dependencies into zip file like this. Then you can use one of the following methods to import it.

|-- kk.zip
|   |-- kk.py

Using –py-files in spark-submit

When submit spark job, add --py-files=kk.zip parameter. kk.zip will be distributed with the main scrip file, and kk.zip will be inserted at the beginning of PATH environment variable.

Then you can use import kk in your main script file.

This utilize Python’s zip import feature. For more information, check this link: zipimport

Using addPyFile in main script

You can also upload zip file to hdfs, and using sc.addPyFile('hdfs://kk.zip') after SparkContext is initialized.

This has the same effect as --py-files, but your import statement must be after this line.

Using cibuildwheel to Create Python Wheels

Have you ever tried to install MySQL-python? It contains the C code and need to compile the code while install the package. You have to follow the steps in this articles: Install MySQL and MySQLClient(Python) in MacOS. Things get worse if you are using Windows.

Luckily, as new distribution format Wheel has been published in PEP 427.

The wheel binary package format frees installers from having to know about the build system, saves time by amortizing compile time over many installations, and removes the need to install a build system in the target environment.

Installation of wheels does not require a compiler on system and is much faster.

Cibuildwheel is a very useful tool for building wheels. It can run on many CI server (GitHub Actions, Travis , Azure Pipelines etc) and build wheels across many platforms.

Usage

You need to create a configuration file for the CI server, you can read the examples and documents.

For example, GitHub Actions can use this configuration file:

name: Build

on: [push, pull_request]

jobs:
  build_wheels:
    name: Build wheels on ${{ matrix.os }}
    runs-on: ${{ matrix.os }}
    strategy:
      matrix:
        os: [ubuntu-18.04, windows-latest, macos-latest]

    steps:
      - uses: actions/checkout@v2

      - uses: actions/setup-python@v2
        name: Install Python
        with:
          python-version: '3.7'

      - name: Install cibuildwheel
        run: |
          python -m pip install cibuildwheel==1.5.5
      - name: Install Visual C++ for Python 2.7
        if: runner.os == 'Windows'
        run: |
          choco install vcpython27 -f -y
      - name: Build wheels
        run: |
          python -m cibuildwheel --output-dir wheelhouse
      - uses: actions/upload-artifact@v2
        with:
          path: ./wheelhouse/*.whl

Useful Options

These options can be applied by setting environment variables.

CIBW_BUILD / CIBW_SKIP

Use this options to filter the Python versions to build.

Example:

# Only build on Python 3.6
CIBW_BUILD: cp36-*

# Skip building on Python 2.7 on the Mac
CIBW_SKIP: cp27-macosx_x86_64

# Skip building on Python 3.8 on the Mac
CIBW_SKIP: cp38-macosx_x86_64

CIBW_BEFORE_BUILD

Execute the shell command before wheel building.

Upload to PyPI

Now you can download wheelhouse.zip from Actions panel on GitHub, and unzip it to dist folder. Then manually publish it by rm -rf dist && python setup.py sdist && twine upload dist/*. You can get more detailed guide from this article: Packaging Python Projects.

This process can also be done automatically by using CI configuration file. You can find the example configuration files from official repo.

Ref

Using JSONField before Django 3.1

In Django 3.1, Django support save python data into database as JSON encoded data and it is also possible to make query based on field value in JSONField. The detailed usage can be found here. If you are using older version and want to try this feature. Though there are many packages ported this function, I recommend django-jsonfield-backport.

django-jsonfield-backport

This package save data as JSON in database and also support JSON query. If your database meet the requirements (MySQL > 5.7, PG > 9.5, MariaDB > 10.2 or SQLite > 3.9 with JSON1 extension), you can use JSONField like Django’s native implementation.

from django.db import models
from django_jsonfield_backport.models import JSONField

class ContactInfo(models.Model):
    data = JSONField()

ContactInfo.objects.create(data={
    'name': 'John',
    'cities': ['London', 'Cambridge'],
    'pets': {'dogs': ['Rufus', 'Meg']},
})
ContactInfo.objects.filter(
    data__name='John',
    data__pets__has_key='dogs',
    data__cities__contains='London',
).delete()

jsonfield

jsonfield is another popular package to use JSONField. It will save data as Text in database, but you can manipulate field value as python data. In addition, it does not provide JSON querying capability as django-jsonfield-backport.

Django REST framework

As data is stored as JSON string in database, the output is string rather than object when Django DRF serialize jsonfield.JSONField. If you prefer to get and update the data like object, you need to manually specify it as `serializer.JSONField` like this:

from rest_framework import serializers
from .models import Product

class ProductSerializer(serializers.ModelSerializer):
    images = serializers.JSONField()
    class Meta:
        model = Product
        fields = '__all__'

(You do not need to do this when using django-jsonfield-backport, everything just works.)

Ref

How to disable auto strip in Charfield in Django

In Django, when edit field in admin page or post data to forms, the leading and tailing whitespace in CharField and TextField are removed.

The reason is strip=True parameter in forms.CharField, which is added in Djagno 1.9. You can see the discussion in django tiket #4960 and here is source code. models.CharField and models.TextField use formfield() to create form to interact with user, then both of them eventually create a forms.CharField

It only affect the value return from forms, you can still update model manually and calling save() to save it with spaces.

Normally, this feature help us to keep text field clean. But sometimes you may want to get the original value, and here are three different solutions:

Suppose we have this Test model.

# models.py
class Test(models.Model):
    char = models.CharField(max_length=20)
    text = models.TextField()

Change ModelAdmin

# admin.py
TestAdmin(admin.ModelAdmin):
    def formfield_for_dbfield(self, db_field, request, **kwargs):
        if db_field.name in ['char', 'text']:
            kwargs['strip'] = False
        return super().formfield_for_dbfield(db_field, request, **kwargs)

This method tackles the problem by overriding fields’ default fromfiled method.

Define Custom Form

# forms.py
class CustomForm(forms.ModelForm):
    char = forms.CharField(strip=False)
    text = forms.CharField(strip=False, widget=forms.Textarea)

    class Meta:
        model = Test
        exclude = []

# admin.py
TestAdmin(admin.ModelAdmin):
    form = CustomForm

Now when edit data in admin panel, the whitespace is not removed anymore.

Use Custom Field

You can also use your custom field in models.py. For example:

# models.py
from django.db.models import TextField


class NonStrippingTextField(TextField):
    def formfield(self, **kwargs):
        kwargs['strip'] = False
        return super(NonStrippingTextField, self).formfield(**kwargs)

class Test(models.Model):
    text = NonStrippingTextField()

====

REST Framework

If you use Django REST framework to edit data, you only need to change the serializer.

class TestSerializer(serializers.HyperlinkedModelSerializer):
    class Meta:
        model = Test
        fields = '__all__'
        extra_kwargs = {"char": {"trim_whitespace": False},
                        "text": {"trim_whitespace": False}}

Ref

Memory Leak in Python multiprocessing.Pool

There is a historical memory leak problem in our Django app and I fixed it recently. As time goes by, the memory usage of app keeps growing and so does the CPU usage.

After some research, I figure out the cause. Some views does not close multiprocessing.Pool after using it. The problem disappears when I use Pool with with statement.

But I’m still interested in it and wrote some testing code. The script is run in Python 3.6.8 and produce similar result when using multiprocessing.ThreadPool.

import time
from multiprocessing import Pool


def func(i):
    return i


def ori():
    # create many thread as time goes by, when i==300 cpu grow to 300%, run out of 16g ram and stuck I have to kill process
    p = Pool(4)
    r.append(p.map(func, range(4)))


def with_close():
    # 100% cpu, 0.1 ram, create 40 thread, takes 41s
    p = Pool(4)
    r.append(p.map(func, range(4)))
    p.close()


def with_terminate():
    # 5% cpu, 0.1 ram, create 4 thread, takes 425s
    p = Pool(4)
    r.append(p.map(func, range(4)))
    p.terminate()


def with_with():
    # same as terminate
    with Pool(4) as p:
        r.append(p.map(func, range(4)))


r = []
s = time.time()
for i in range(4000):
    ori()
    # with_close()
    # with_terminate()
    # with_with()

    if i % 100 == 0:
        print(i)

print(f'takes {time.time() - s} seconds')

As you can see, there are four functions. The ori function is Pool with no close and terminate, the RAM keeps growing and the script stuck. with_close, with_terminate and with_with will exit normally but time is different.

Why `close()` is faster than `terminate()`

Pool.terminate() will call terminate() in each worker. Pool.close() just change the pool states and each worker will terminate itself. You can find the source code on GitHub.

Verify Memory Leak

import gc
import time
import weakref


from multiprocessing import Pool

def func(i):
    return i


p = Pool(4)
wr = weakref.ref(p)
p.map(func, range(4))
print(wr())
print(gc.get_referents(wr()))
# p.close()
# p.terminate()
time.sleep(1)
del p
gc.collect()
print(wr())
print(gc.get_referents(wr()))

If not calling close or terminate, after execution, p is still referred by some objects:

<multiprocessing.pool.Pool object at 0x7fc0e6db0828>
[{'_ctx': <multiprocessing.context.ForkContext object at 0x7fc0e6d455c0>, '_inqueue': <multiprocessing.queues.SimpleQueue object at 0x7fc0e6db0860>, '_outqueue': <multiprocessing.queues.SimpleQueue object at 0x7fc0e5babac8>, '_quick_put': <bound method _ConnectionBase.send of <multiprocessing.connection.Connection object at 0x7fc0e620e8d0>>, '_quick_get': <bound method _ConnectionBase.recv of <multiprocessing.connection.Connection object at 0x7fc0e4d241d0>>, '_taskqueue': <queue.Queue object at 0x7fc0e4d24320>, '_cache': {}, '_state': 0, '_maxtasksperchild': None, '_initializer': None, '_initargs': (), '_processes': 4, '_pool': [<ForkProcess(ForkPoolWorker-1, started daemon)>, <ForkProcess(ForkPoolWorker-2, started daemon)>, <ForkProcess(ForkPoolWorker-3, started daemon)>, <ForkProcess(ForkPoolWorker-4, started daemon)>], '_worker_handler': <Thread(Thread-1, started daemon 140466410379008)>, '_task_handler': <Thread(Thread-2, started daemon 140466401986304)>, '_result_handler': <Thread(Thread-3, started daemon 140466393593600)>, '_terminate': <Finalize object, callback=_terminate_pool, args=(<queue.Queue object at 0x7fc0e4d24320>, <multiprocessing.queues.SimpleQueue object at 0x7fc0e6db0860>, <multiprocessing.queues.SimpleQueue object at 0x7fc0e5babac8>, [<ForkProcess(ForkPoolWorker-1, started daemon)>, <ForkProcess(ForkPoolWorker-2, started daemon)>, <ForkProcess(ForkPoolWorker-3, started daemon)>, <ForkProcess(ForkPoolWorker-4, started daemon)>], <Thread(Thread-1, started daemon 140466410379008)>, <Thread(Thread-2, started daemon 140466401986304)>, <Thread(Thread-3, started daemon 140466393593600)>, {}), exitprority=15>}, <class 'multiprocessing.pool.Pool'>]
<multiprocessing.pool.Pool object at 0x7fc0e6db0828>
[{'_ctx': <multiprocessing.context.ForkContext object at 0x7fc0e6d455c0>, '_inqueue': <multiprocessing.queues.SimpleQueue object at 0x7fc0e6db0860>, '_outqueue': <multiprocessing.queues.SimpleQueue object at 0x7fc0e5babac8>, '_quick_put': <bound method _ConnectionBase.send of <multiprocessing.connection.Connection object at 0x7fc0e620e8d0>>, '_quick_get': <bound method _ConnectionBase.recv of <multiprocessing.connection.Connection object at 0x7fc0e4d241d0>>, '_taskqueue': <queue.Queue object at 0x7fc0e4d24320>, '_cache': {}, '_state': 0, '_maxtasksperchild': None, '_initializer': None, '_initargs': (), '_processes': 4, '_pool': [<ForkProcess(ForkPoolWorker-1, started daemon)>, <ForkProcess(ForkPoolWorker-2, started daemon)>, <ForkProcess(ForkPoolWorker-3, started daemon)>, <ForkProcess(ForkPoolWorker-4, started daemon)>], '_worker_handler': <Thread(Thread-1, started daemon 140466410379008)>, '_task_handler': <Thread(Thread-2, started daemon 140466401986304)>, '_result_handler': <Thread(Thread-3, started daemon 140466393593600)>, '_terminate': <Finalize object, callback=_terminate_pool, args=(<queue.Queue object at 0x7fc0e4d24320>, <multiprocessing.queues.SimpleQueue object at 0x7fc0e6db0860>, <multiprocessing.queues.SimpleQueue object at 0x7fc0e5babac8>, [<ForkProcess(ForkPoolWorker-1, started daemon)>, <ForkProcess(ForkPoolWorker-2, started daemon)>, <ForkProcess(ForkPoolWorker-3, started daemon)>, <ForkProcess(ForkPoolWorker-4, started daemon)>], <Thread(Thread-1, started daemon 140466410379008)>, <Thread(Thread-2, started daemon 140466401986304)>, <Thread(Thread-3, started daemon 140466393593600)>, {}), exitprority=15>}, <class 'multiprocessing.pool.Pool'>]

After calling close() or terminate(), the last two lines become:

None
[]

Document Update

The Python3.7 document adds this warning:

multiprocessing.pool objects have internal resources that need to be properly managed (like any other resource) by using the pool as a context manager or by calling close() and terminate() manually. Failure to do this can lead to the process hanging on finalization. Note that is not correct to rely on the garbage collector to destroy the pool as CPython does not assure that the finalizer of the pool will be called (see object.__del__() for more information).

The Bug is Fixed in Python 3.8

In python 3.8.6, the script exits normally and the total execution time also decreases without calling close(). I found this issue is fixed in Python bug tracker: multiprocessing.Pool and ThreadPool leak resources after being deleted.

Python 3.11 changes

In [[https://github.com/Azure/azure-cli/pull/26923][[Packaging] Support Python 3.11 by bebound · Pull Request #26923 · Azure/azure-cli (github.com)]] , I bumped azure-cli to use Python 3.11. We’ve bump the dependency in other PRs, I thought it should be a small PR, but in the end, a lot of changes are made.

`args.getargspec`

getargspec is dropped in 3.11. You can easily replaced it with =getfullargspec= . It returns FullArgSpec(args, varargs, varkw, defaults, kwonlyargs, kwonlydefaults, annotations) instead of ArgSpec(args, varargs, keywords, defaults) So args, _, kw, _ = inspect.getargspec(fn) can be replaced by args, _, kw, *_ = inspect.getfullargspec(fn) However, getfullargspec is retained primarily for use in code that needs to maintain compatibility with the Python 2 inspect module API.

Note that =signature()= and Signature Object provide the recommended API for callable introspection, and support additional behaviours (like positional-only arguments) that are sometimes encountered in extension module APIs. This function is retained primarily for use in code that needs to maintain compatibility with the Python 2 inspect module API. –inspect — Inspect live objects — Python 3.11.4 documentation

The modern signature function provides the similar result but needs more modification:

import inspect
def testfunc(a, /, b=1, c=2, *args, kk, **kwargs):
    pass

print(inspect.getfullargspec(testfunc))
print(inspect.signature(testfunc).parameters)

for i, j in inspect.signature(testfunc).parameters.items():
    print(i, type(i), j, type(j), j.kind)

args, _, kw, *_ = inspect.getfullargspec(testfunc)
print(args, kw)

from inspect import Parameter

parameters = inspect.signature(testfunc).parameters
args = [k for k, v in parameters.items() if v.kind in {Parameter.POSITIONAL_OR_KEYWORD, Parameter.POSITIONAL_ONLY}]
kw = next(iter([k for k, v in parameters.items() if v.kind == Parameter.VAR_KEYWORD]), None)
print(args, kw)

FullArgSpec(args=['a', 'b', 'c'], varargs='args', varkw='kwargs', defaults=(1, 2), kwonlyargs=['kk'], kwonlydefaults=None, annotations={})
OrderedDict([('a', <Parameter "a">), ('b', <Parameter "b=1">), ('c', <Parameter "c=2">), ('args', <Parameter "*args">), ('kk', <Parameter "kk">), ('kwargs', <Parameter "**kwargs">)])

a <class 'str'> a <class 'inspect.Parameter'> POSITIONAL_ONLY
b <class 'str'> b=1 <class 'inspect.Parameter'> POSITIONAL_OR_KEYWORD
c <class 'str'> c=2 <class 'inspect.Parameter'> POSITIONAL_OR_KEYWORD
args <class 'str'> *args <class 'inspect.Parameter'> VAR_POSITIONAL
kk <class 'str'> kk <class 'inspect.Parameter'> KEYWORD_ONLY
kwargs <class 'str'> **kwargs <class 'inspect.Parameter'> VAR_KEYWORD

['a', 'b', 'c'] kwargs
['a', 'b', 'c'] kwargs

Enum `format` change

There is some custom classes in azure-cli, which makes Foo.BAR=‘bar’. In 3.11, the [[https://docs.python.org/3/whatsnew/3.11.html#enum][Enum]] =__format__() changes, it returns the enum and member name (ex: Color.RED). (The __str__ method is the same as Python 3.10)

Changed =Enum.__format__()= (the default for =format()=, =str.format()= and f-strings) to always produce the same result as Enum.__str__(): for enums inheriting from =ReprEnum= it will be the member’s value; for all other enums it will be the enum and member name (e.g. =Color.RED=). –What’s New In Python 3.11

from enum import Enum
class Foo(str, Enum):
    BAR = "bar"

# Python 3.10
f"{Foo.BAR}"  # > bar
str(Foo.BAR)  # > Foo.BAR

# Python 3.11
f"{Foo.BAR}"  # > Foo.BAR
str(Foo.BAR)  # > Foo.BAR

The standard way to replace Foo class is StrEnum

class Foo(StrEnum):
    BAR = "bar"

# Python 3.11
f"{Foo.BAR}"  # > bar

If you also use Bar(int, Enum), you can replace it with ReprEnum: Bar(int, ReprEnum).

`unittest.Mock`

The unittest module replace unittest.mock._importer with pkgutil.resolve_name in bpo-44686 replace unittest.mock._importer with pkgutil.resolve_name by graingert · Pull Request #18544 · python/cpython (github.com), which also introduces some changes.

Previously, it use __import__ to import the patch target, which does not check the module name. But pkgutil.resolve_name will check name first, thus mock.patch fails if the target is not a valid Python module name. For example, this statement fails in 3.11:

@mock.patch('azure.cli.command_modules.vm.aaz.2020_09_01_hybrid.network.vnet.List', _mock_network_client_with_existing_vnet_location)

as 2020_09_01_hybrid is not a valid variable name in Python.

            _NAME_PATTERN = re.compile(f'^(?P<pkg>{dotted_words})'
                                       f'(?P<cln>:(?P<obj>{dotted_words})?)?$',
                                       re.UNICODE)

        m = _NAME_PATTERN.match(name)
        if not m:
>           raise ValueError(f'invalid format: {name!r}')
E           ValueError: invalid format: 'azure.cli.command_modules.vm.aaz.2020_09_01_hybrid.network.vnet'

As a workaround, mock.patch.object works.

vnet = import_module('azure.cli.command_modules.vm.aaz.2018_03_01_hybrid.network.vnet')
with mock.patch.object(vnet, 'List', _mock_network_client_with_existing_vnet):

The ultimate solution is fix module name.

`argparse.ArgumentError`

bpo-39716: Raise on conflicting subparser names. by anntzer · Pull Request #18605 · python/cpython (github.com) Raise an ArgumentError when the same subparser name is added twice.

import argparse

parser = argparse.ArgumentParser()
t = parser.add_subparsers()
t.add_parser('a')
t.add_parser('a')

The above code works on 3.10 but raises this error in 3.11:

Traceback (most recent call last):
  File "C:\Users\kk\Developer\azure-cli\p.py", line 6, in <module>
    t.add_parser('a')
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1264.0_x64__qbz5n2kfra8p0\Lib\argparse.py", line 1192, in add_parser
    raise ArgumentError(self, _('conflicting subparser: %s') % name)
argparse.ArgumentError: argument {a}: conflicting subparser: a

Ref

`import` in Python

It’s known that Python’s import statement is implemented by __import__ function. In general, if we want to import a module dynamically, we can use import_module function, which is a wrapper around __import__.

The most important difference between these two functions is that import_module() returns the specified package or module (e.g. pkg.mod), while __import__() returns the top-level package or module (e.g. pkg). – https://docs.python.org/3/library/importlib.html#importlib.import_module

import itertools and from requests import exceptions can be translated to:

import importlib

itertools = importlib.import_module('itertools')
exceptions = importlib.import_module('requests.exceptions')

`import`

This is an advanced function that is not needed in everyday Python programming, unlike importlib.import_module().

Here is an example of how __import__ is called:

old_import = __import__

def noisy_importer(name, globals=None, locals=None, fromlist=None, level=0):
    print(f'name: {name!r}')
    print(f'fromlist: {fromlist}')
    print(f'level: {level}')
    print('-' * 80)
    return old_import(name, locals, globals, fromlist, level)

import builtins
builtins.__import__ = noisy_importer

print('import math')
import math
print('from math import sqrt')
from math import sqrt

>>>
import math
name: 'math'
fromlist: None
level: 0
--------------------------------------------------------------------------------
from math import sqrt
name: 'math'
fromlist: ('sqrt',)
level: 0
--------------------------------------------------------------------------------

As we mentioned earlier, the __import__ returns the top level module.

For example, requests=__import('requests.exceptions',globals(),locals(),[],0). If you want to get the submodule exceptions, you need to use getattr: equests_exceptions=getattr(__import__('requests', globals(), locals(), [], 0), 'exceptions').

There is another tricky way to import the submodule: use a non-empty fromlist: requests_exceptions = __import__('requests.exceptions', globals(), locals(), [None], 0).

Additionally, we can also set fromlist to specify the names of submodules that should be imported. The statement from spam.ham import eggs, sausage as saus can be translated to

_temp = __import__('spam.ham', globals(), locals(), ['eggs', 'sausage'], 0)
eggs = _temp.eggs
saus = _temp.sausage

Skip importing non-existing modules with `import`

This a use case of the __import__ function. Some packages are missing, but we want to make sure that the code does not crash when importing them.

import builtins
from unittest.mock import Mock
old_import = __import__

def skip_imports(name, globals=None, locals=None, fromlist=None, level=0):
    skip_list = {'urllib3', 'requests_oauthlib', 'cryptography'}
    if name in skip_list or any(name.startswith(f'{p}.') for p in skip_list):
        return Mock()
    else:
        return old_import(name, globals, locals, fromlist, level)

builtins.__import__ = skip_imports

Ref:

`sys.path` in Python

Here is the process how sys.path is set in Python, with some parts omitted.

Python Command Line Arguments

By default, as initialized upon program startup, a potentially unsafe path is prepended to sys.path:

python -m: prepend the current working directory.

python script.py: prepend the script’s directory. If it’s a symbolic link, resolve symbolic links.

python -c and python (REPL): prepend an empty string, which means the current working directory.

You can remove these path with -P param.

`PYTHONPATH`

If this environment variable is set, the folders in it will be added to sys.path. The folders are separated by colons on Unix and semicolons on Windows.

`prefix` and `exec_prefix`

These two variable define the standard Python modules and extension modules. Python has a specific path to search depends on the OS. The start point is Python executable path, which is called home (the symbolic links are followed).

Once home is determined, the prefix directory is found by looking for pythongmajorversionminorversion.zip. For example, python312.zip. On Windows, the zip package is in the same directory as the Python executable. On Unix, it is in /lib folder. If it is not found, on Windows, it will looks for Lib\os.py. On Unix, it will look for lib/python3.12/os.py.

On macOS, the home is /opt/homebrew/opt/python@3.12/Frameworks/Python.framework/Versions/3.12/bin/python3.12. The prefix is /opt/homebrew/opt/python@3.12/Frameworks/Python.framework/Versions/3.12, because lib/python3.12/os.py is there.

On Windows, the exec_prefix is the same as prefix. But on other OS, exec_prefix is determined by lib/python3.xx/lib-dynload. On my mac, it’s still /opt/homebrew/opt/python@3.12/Frameworks/Python.framework/Versions/3.12.

lib/python312.zip, lib/python3.12 and lib/python3.12/lib-dynload are added into sys.path.

`site` module

This module is automatically called during Python startup, which tries to append the site-packages folder into sys.path. It can be disabled with -S option.

Finding site-packages folder is easy. It can be guessed by prefix and exec_prefix. These two path is head, and the tail part is lib/site-packages on Windows or lib/pythonX.Y/site-packages on *nix. For each of the head-tail combinations, it add the path into sys.path if it exists.

.pth files

If a name.pth file exits in the site-packages folder, its content are additional items to be added into sys.path. Each line is a relative path.

The site module also tries to add USER_SITE folder into sys.path. Default value is ~~/.local/lib/pythonX.Y/site-packages~ for UNIX and non-framework macOS builds, ~~/Library/Python/X.Y/lib/python/site-packages~ for macOS framework builds, and %APPDATA%\Python\PythonXY\site-packages on Windows.

Example

We can use python3 -m site to quickly check sys.path and user site. Here is the output on my mac:

sys.path = [
    '{current folder}',
    '/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python312.zip',
    '/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12',
    '/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/lib-dynload',
    '/opt/homebrew/lib/python3.12/site-packages',
    '/opt/homebrew/opt/python@3.12/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages',
]
USER_BASE: '/Users/kk/Library/Python/3.12' (doesn't exist)
USER_SITE: '/Users/kk/Library/Python/3.12/lib/python/site-packages' (doesn't exist)
ENABLE_USER_SITE: True

The path is slightly different from what the document states. The site-packages folder is not the same as prefix. I guess that because Homebrew creates lots of symbol link. python3 is /opt/homebrew/bin/python3 -> opt/homebrew/Cellar/python@3.12/3.12.4/bin/python3 -> /opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/bin/python3.

>>> import sys
>>> sys.prefix
'/opt/homebrew/opt/python@3.12/Frameworks/Python.framework/Versions/3.12'
>>> sys.exec_prefix
'/opt/homebrew/opt/python@3.12/Frameworks/Python.framework/Versions/3.12'

Here is the output on my Ubuntu server:

sys.path = [
    '{current folder}',
    '/usr/lib/python38.zip',
    '/usr/lib/python3.8',
    '/usr/lib/python3.8/lib-dynload',
    '/home/kk/.local/lib/python3.8/site-packages',
    '/usr/local/lib/python3.8/dist-packages',
    '/usr/local/lib/python3.8/dist-packages/cloud_init-20.1-py3.8.egg',
    '/usr/lib/python3/dist-packages',
]
USER_BASE: '/home/kk/.local' (exists)
USER_SITE: '/home/kk/.local/lib/python3.8/site-packages' (exists)
ENABLE_USER_SITE: True

The doc also does not explain why `/opt/homebrew/lib/python3.12/site-packages` is in the path. This doc is somewhat out-of-date: https://discuss.python.org/t/the-document-on-pythonhome-might-be-wrong/19614

Ref:

Modern pip build process (–use-pep517)

Nowadays, pyproject.toml becomes the standard configuration file for packaging. Compare with the old setup.py, it adds two feature pep517 and pep518.

pep517 defines two hooks: build_wheel and build_sdist, which is required to build the package from source. Each build backend must implement these two hooks. It makes it possible to create other build backend such as flit or poetry.

[build-system]
# Defined by PEP 518:
requires = ["flit"]
# Defined by this PEP:
build-backend = "local_backend"
backend-path = ["backend"]

Besides setuptools, there are some other build back-end such as hatchling and flit. You can find the example here: Python Packaging Uer Guide - Choosing a build backend

pep518 defines the format of pyproject.toml, where you can specify you build dependencies, ensuring the necessary tools will be installed when building project. For example:

[build-system]
requires = ["setuptools ~= 58.0", "cython ~= 0.29.0"]

Is `setup.py` deprecated

According to python packaging doc, it is still the valid configuration file for setuptools, but use setup.py in command is depracetd:

Deprecated	Replacement
python setup.py install	pip install .
python setup.py develop	pip install -e .
python setup.py sdist	python -m build
python setup.py bdist_wheel	python -m build

Build is a pep 517 compatible build fontend, it calls build backend to generate the source and wheel distribution. It’s the recommended way to build the package.

When `--use-pep517` is activated?

If the source distribution contains pyproject.toml, pip will use pep517 to build the package.

If the current env does not have setuptools` or wheel, pip will use pep517 to build the package: pip source code

You can also force pip to use pep517 with --use-pep517, or disable it and use the legacy behavior with --no-use-pep517.

What’s the difference between `--use-pep517` and legacy behavior?

This is a typical log which uses --use-pep517:

  root@427314aff523:/# pip install requests --no-binary :all: --no-deps
Collecting requests
  Using cached requests-2.32.3.tar.gz (131 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: requests
  Building wheel for requests (pyproject.toml) ... done
  Created wheel for requests: filename=requests-2.32.3-py3-none-any.whl size=64922 sha256=9ee1e853d3d86a8b484cf10c2920601befe81bfad4bd0c3319274b67143ac266

This is the one which uses the legacy behavior:

  Processing d:\a\_work\1\s\src\azure-cli
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: azure-cli
  Building wheel for azure-cli (setup.py): started
  Building wheel for azure-cli (setup.py): finished with status 'done'

The main difference is that --use-pep517 will create a temp build env and build the package in it. The build env is totally fresh, it has to install the build dependencies such as setuptools first, then call the backend to build the wheel pacakge. Finally, it will install the wheel package.

In legacy behavior, pip use the current env’s pip, setuptools and wheel to build the package with python setup.py bdist_wheel, then install the wheel package.

A weird issue related to `--use-pep517`

When bump Python 3.12 in Azure CLI, the get-pip.py does not install setuptools by default, as well as wheel. So the pip tries to use pep517 to build azure-cli.

However, the runner agent is using Python 3.12.6 and the embedded Python is 3.12.7. They have a compatibility issue. In the build env, the python -m pip fails with code 57005, because it tries to load modules in 3.12.6. I have to use python -Im pip to install the package. However, in the build env, the -I param is not honored, the command still fails. I’ve created an issue in pip repo. The workaround is so complicated, so I have to install setuptools and wheel to let pip use the legacy behavior, which use the env’s pip to build the package.

The details can be found in the PR

Ref:

Namespace Package in Python

Recently, there is a GitHub issue about namespace package in Azure CLI. I think it is a good time to write down the knowledge about namespace package.

What is Namespace Package

If several packages share the same root folder, then the root folder is a namespace package. subpackageA and subpackageb can be installed separately, even in different Python path, but they can be imported as importing a single package: import root.

How to create namespace Package

There are three ways to create namespace package in Python, you can find the details in Packaging namespace packages.

Native namespace packages

Don’t need to create __init__.py in root folder, and use include = ["mynamespace.subpackage_a"] in pyproject.toml.

After installation, the root folder does not contain __init__.py, so Python treats it as an implicit because of PEP 420.

Legacy method: pkgutil-style namespace packages

The only reason to use this method is to support Python2. You need to create this __init__.py in root folder:

__path__ = __import__('pkgutil').extend_path(__path__, __name__)

After installation, the __init__.py is also kept in the root folder.

Legacy method: pkg_resource-style namespace packages

This method relies on setuptools. After Python 3.12, setuptools is not installed by default, and now it’s also deprecated in setuptools. Currently, it shows this warning:

UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.

So I don’t recommend using this method.

You need to create this __init__.py in root folder, and declare the namespace package in setup.py: namespace_packages=['mynamespace'].

__import__('pkg_resources').declare_namespace(__name__)

After installation, the __init__.py is not kept in the root folder.

If you forget to declare the namespace package in setup.py, __init__.py is kept in the root folder, and the root folder become a normal package. You will get trouble if you install subpackages in different Python path.

Difference between these three styles

The main difference is whether there is a __init__.py in the root folder after installation. If there is, then it is a imported as regular package, otherwise it is a namespace package. pkgutil style will include a __init__.py in the root folder.

Another difference is that if it’s a namespace package, then if you change the sys.path, the namespace package will be updated. If it’s a regular package, then it will not be updated. For example, if you update the sys.path and there is new subpackage in the added path, the namespace package will be updated to include the new subpackages. But if it’s a regular package, it will not be updated, and you will get ModuleNotFoundError when you try to import the new subpackages.

I encountered this issue in the development of Azure CLI. azure-cli and azure-cli-core still use the pkgutls style namespace package. After installing them in editable mode, the extensions’ azure-xxxx dependency fails to load. Because there is a __init__.py in source code, and azure folder is treated as a regular package. The extension dependencies are added to the sys.path after azure is imported so the new azure=xxx package is ignored. (Maybe deleting azure from sys.moudule and importing again can fix this but I haven’t tried this.)

Mix use of different style namespace packages

You should avoid mixing different style namespace packages, but if you’re working with legacy code, you may encounter this situation. (It get even worse that some packages’ packaging is wrong. For example, they forget to declare the namespace package in setup.py in pkg_resource-style namespace packages, then the root folder becomes a normal package.)

Let’s see the Python’s import mechanism first:

If <directory>/foo/__init__.py is found, a regular package is imported and returned.

If not, but <directory>/foo.{py,pyc,so,pyd} is found, a module is imported and returned. The exact list of extension varies by platform and whether the -O flag is specified. The list here is representative.

If not, but <directory>/foo is found and is a directory, it is recorded and the scan continues with the next directory in the parent path. Otherwise the scan continues with the next directory in the parent path.

–PEP 420 Specification

As you can see, if the __init__.py is found, the process stops and a regular package is imported. When you use the pkgutil style, it will also returns a normal package, even though it has tries to import the packages from all sys.path.

For example, if you use the pypa sample-namespace-packages to test the namespace package. In the pkgutil style package, import example_pkg returns a regular package: <module 'example_pkg' from '/Users/kk/Developer/sample-namespace-packages/pkgutil/pkg_a/example_pkg/__init__.py'>. In other two styles, it’s a namespace package: <<module 'example_pkg' (namespace) from ['/Users/kk/Developer/sample-namespace-packages/native/pkg_a/example_pkg']>.

So, it’s okay to mix these three styles, as all subpackages should be imported. But if some invalid package (not following these three styles, only a normal package) is also installed, the import might be interrupted by the invalid package.

If you use the native style or pkg_resource style but there is a normal __init__.py in sys.path. Although the namespace package has been imported, Python still only returns the normal package and ignored the namespace package. That’s why in the #31843 issue, the azure.cli is not found.

If a normal package and the pkgutil style namespace package are installed, if the pkgutil package loads first, then all subpackages are imported. If the normal package loads first, then only the normal package will be returned.

In conclusion, if you want to use namespace package, then use the native styles and make sure each subpackages are packaged correctly. Otherwise, you may encounter some module not found error.

Misc

Some Useful Shell Tools

Here are some shell tools I use, which can boost your productivity. Mordern-unix is a great repo that list lots of modern unix tools.

Prezto

A zsh configuration framework. Provides auto completion, prompt theme and lots of modules to work with other useful tools. I extremely love the agnoster theme.

Fasd

Help you to navigate between folders and launch application.

Here are the official usage example:

v def conf       =>     vim /some/awkward/path/to/type/default.conf
j abc            =>     cd /hell/of/a/awkward/path/to/get/to/abcdef
m movie          =>     mplayer /whatever/whatever/whatever/awesome_movie.mp4
o eng paper      =>     xdg-open /you/dont/remember/where/english_paper.pdf
vim `f rc lo`    =>     vim /etc/rc.local
vim `f rc conf`  =>     vim /etc/rc.conf

pt

A fast code search tool similar to ack.

fzf

A great fuzzy finder, it can also integrate with vim by fzf.vim

thefuck

Magnificent app which corrects your previous console command.

tldr

More concise and user-friendly man pages. (This screenshot uses powerlevel10k theme)

ripgrep

Another search tool. Use rg -. to include hidden files.

fd

A user-friendly alternative to find. Ignore hidden files and gitignore file by default.

For example: fd -H 'flac$' search all files ends with flac.

bat

Similar to cat with syntax highlighting and git integration.

Zim

A fast Zsh framework. You can use OMZ plugin like this:

export ZSH_CACHE_DIR=~/.cache

zmodule ohmyzsh/ohmyzsh --use degit --source 'plugins/fasd/fasd.plugin.zsh'

—

update 18/11/19 Add tldr powerlevel10k theme is a fancy ZSH theme
update 29/12/21 Add rg, bat, fd
update 06/01/22 Add Zim
update 01/04/24 Add maintained-modern-unix repo

Start

Over the years, I have read so many programmers’ blogs, which has helped me a lot. Now I think it’s the time to start my own blog.

I hope this can enforce myself to review what I have learned, and it would even be better if someone can benefit from it.

Preview LaTeX in Org Mode with Emacs in MacOS

Using the right Emacs Version

I failed to preview LaTeX with emacs-plus. If you have installed d12frosted/emacs-plus, uninstall it and use emacs-mac.

brew tap railwaycat/emacsmacport
brew install emacs-mac

If you like the fancy spacemacs icon, install it with cask: brew cask install emacs-mac-spacemacs-icon

Install Tex

Download and install BasicTeX.pkg here.
Add /Library/TeX/texbin to PATH.
Install dvisvgm by sudo tlmgr update --self && sudo tlmgr install dvisvgm collection-fontsrecommended

Emacs settings

Add TeX related bin to path: (setenv "PATH" (concat (getenv "PATH") ":/Library/TeX/texbin"))
Tell Org Mode to create svg images: (setq org-latex-create-formula-image-program 'dvisvgm)

Now you can see the rendered LaTeX equation by calling org-preview-latex-fragment or using shortcut ,Tx.

If you want to load LaTeX previews automatically at startup, add this at the beginning of org file: #+STARTUP: latexpreview.

—

update 31-07-19
_ and ... are not displayed in Emacs, as some fonts are missing. tlmgr install collection-fontsrecommended should fix this.

If Org Preview Latex buffer output warn processing of PostScript specials is disabled (Ghostscript not found), run brew install ghostscript.

Build Your Own Tiny Tiny RSS Service

After Inoreader change the free plan, which limit the max subscription to 150, I begin to find an alternative. Finally, I found Tiny Tiny RSS. It has a nice website and has the fever API Plugin which was supported by most of the RSS reader app, so you can read RSS on all of you devices.

This post will tell you how to deploy it on your server.

Prerequisite

You need to install Docker and Docker Compose before using docker-compose.yml

Install docker

Make a new ttrss folder, create docker-compose.yml with this content:

version: "3"
services:
  database.postgres:
    image: sameersbn/postgresql:latest
    container_name: postgres
    environment:
      - PG_PASSWORD=PWD # please change the password
      - DB_EXTENSION=pg_trgm
    volumes:
      - ~/postgres/data/:/var/lib/postgresql/ # persist postgres data to ~/postgres/data/ on the host
    ports:
      - 5433:5432
    restart: always

  service.rss:
    image: wangqiru/ttrss:latest
    container_name: ttrss
    ports:
      - 181:80
    environment:
      - SELF_URL_PATH=https://RSS.com/ # please change to your own domain
      - DB_HOST=database.postgres
      - DB_PORT=5432
      - DB_NAME=ttrss
      - DB_USER=postgres
      - DB_PASS=PWD # please change the password
      - ENABLE_PLUGINS=auth_internal,fever,api_newsplus # auth_internal is required. Plugins enabled here will be enabled for all users as system plugins
      - SESSION_COOKIE_LIFETIME = 8760
    stdin_open: true
    tty: true
    restart: always
    command: sh -c 'sh /wait-for.sh database.postgres:5432 -- php /configure-db.php && exec s6-svscan /etc/s6/'

  service.mercury: # set Mercury Parser API endpoint to =service.mercury:3000= on TTRSS plugin setting page
    image: wangqiru/mercury-parser-api:latest
    container_name: mercury
    expose:
      - 3000
    ports:
      - 3000:3000
    restart: always

Run this command to deploy: docker-compose up -d. After it finished, the TTRSS service is running on port 181, the default account is admin with password password.

I made minor modification on the yml file, you can find the latest file here.

Nginx Configuration

If you have a domain and you can use Nginx as reverse proxy to redirect TTRSS to the domain.

upstream ttrssdev {
    server 127.0.0.1:181;
}

server {
    listen 80;
    server_name  RSS.com;
    return 301 https://RSS.com/$request_uri;
}

server {
    listen 443 ssl;
    gzip on;
    server_name  RSS.com;


    access_log /var/log/nginx/ttrssdev_access.log combined;
    error_log  /var/log/nginx/ttrssdev_error.log;

    location / {
        proxy_redirect off;
        proxy_pass http://ttrssdev;

        proxy_set_header  Host                $http_host;
        proxy_set_header  X-Real-IP           $remote_addr;
        proxy_set_header  X-Forwarded-Ssl     on;
        proxy_set_header  X-Forwarded-For     $proxy_add_x_forwarded_for;
        proxy_set_header  X-Forwarded-Proto   $scheme;
        proxy_set_header  X-Frame-Options     SAMEORIGIN;

        client_max_body_size        100m;
        client_body_buffer_size     128k;

        proxy_buffer_size           4k;
        proxy_buffers               4 32k;
        proxy_busy_buffers_size     64k;
        proxy_temp_file_write_size  64k;
    }
    ssl_certificate /etc/letsencrypt/live/rss.fromkk.com/fullchain.pem; # managed by Certbot
    ssl_certificate_key /etc/letsencrypt/live/rss.fromkk.com/privkey.pem; # managed by Certbot

}

To enable HTTPS on your website, you can use certbot.

Caddy Configuration

Update in 22/12/2021

I found Caddy2 is much easier to use than Nginx, all you need to do is add 3 lines in `/etc/caddy/Caddyfile`

rss.com {
encode gzip zstd
reverse_proxy  127.0.0.1:181
}

Voila, a HTTPS enabled website is deployed.

Fever API and Mercury

Fever
1. Check Enable API: Allows accessing this account through the API in preference
2. Enter a new password for fever in Plugins - Fever Emulation
Mecury Fulltext Extraction
1. Check mecury-fulltext plugin in Preference - Plugins
2. Set Mercury Parser API address to service.mercury:3000 in Feeds - Mercury Fulltext settings

Update

Simply run this command to update TTRSS code.

docker-compose pull
docker-compose up -d

App recommendation

Reeder 4 works great on my iPad. It’s smooth and fast, and is worth every penny.

If you want a free app, I suggest Fiery Feeds. I stopped using it after ver 2.2, as it’s so lagging. If this issue was fixed, I thought it was the biggest competitor for Reeder 4. For more alternative, read this article: The Best RSS App for iPhone and iPad.

update 25-03-20:

You can find the latest document here.

Ref

A ttrss setup guide - Start your own RSS aggregator today

Jaeger Code Structure

Here is the main logic for jaeger agent and jaeger collector. (Based on jaeger 1.13.1)

Jaeger Agent

Collect UDP packet from 6831 port, convert it to model.Span, send to collector by gRPC

Jaeger Collector

Process gRPC or process packet from Zipkin(port 9411).

Jaeger Query

Listen gRPC and HTTP request from 16686.

Time boundary in InfluxDB Group by Time Statement

These days I use InfluxDB to save some time series data. I love these features it provides:

High Performance

According to to it’s hardware guide, a single node will support more than 750k point write per second, 100 moderate queries per second and 10M series cardinality.

Continuous Queries

Simple aggregation can be done by InfluxDB’s continuous queries.

Overwrite Duplicated Points

If you submit a new point with same measurements, tag set and timestamp, the new data will overwrite the old one.

Preset Time Boundary

InfluxDB is well documented, but the group by time section is not very clear. It says it will group data by ==preset time boundary=. But the example it use is too simple and doesn’t explain it very well.

In the official example, when using group by time(12m)=, the time boundary is 00:12, 00:24. When using group by time(30m), the time boundary becomes 00:00, 00:30. It seems that the time boundary start from the nearest hour plus x times time interval, that’s not correct. If you using group by time(7m), the returned time boundary is not 00:07, 00:14

Here a example:

If the data is:

{'time': '2020-01-01T00:02:00Z', 'value': 10}
{'time': '2020-01-01T00:04:00Z', 'value': 8}
{'time': '2020-01-01T00:05:00Z', 'value': 21}
{'time': '2020-01-01T00:07:00Z', 'value': 33}
{'time': '2020-01-02T00:05:00Z', 'value': 9}
{'time': '2020-01-03T10:05:00Z', 'value': 4}

Execute select sum(value) from data where time>='2020-01-01 00:00:00' and time<'2020-01-04 00:00:00' group by time(7m) fill(none) will output:

{'time': '2019-12-31T23:58:00Z', 'sum': 18}
{'time': '2020-01-01T00:05:00Z', 'sum': 54}
{'time': '2020-01-02T00:00:00Z', 'sum': 9}
{'time': '2020-01-03T10:04:00Z', 'sum': 4}

Note that the time boundary begins at 12-31 23:58, not 01-01 00:00. What cause this?

InfluxDB using timestamp 0 (1970-01-01T00:00:00Z) as start time, and for each timestamp that is dividable by the group by interval, it create a boundary. So in this sql, the boundary should be timestamp 0, timestamp 420, timestamp 840 etc. 2019-12-31 23:58:00 convert to timestamp 1577836680, it’s dividable by 420, so this is the nearest time boundary among the given data.

When you use gourp by time(1w), you will also meet this problem: the result time begins with Thursday rather than Monday. As 1970-01-01 is Thursday.

So when you use group by time statement, you’d better use 30s, 1m, 5m, 10m as interval, which are factors of 1h, so the result always begin at xx:00.

Some times you want to calculate the sum of last recent 5m data every minute, by using group by time(5m), you only get 1 result every 5 minute. To achieve this, you can use the offset parameter in group by time statement. For example, group by time(5m,1m) with move the time boundary 1 minute forward, the result will be xx:01, xx:06. you can create 5 continuous queries with offset from 0 to 4.

More example can be found in this repo.

Group by in Continuous Queries

By reading the official resample document, the resample every <interval> for <interval> can override the continuous queries execute interval and the time range of query statement.

The example in official document the interval is always a multiple of group by time(m). I tries different values, here is the result.

Every Interval

every interval can be any value regardless of group by time interval. The CQ will execute at the time boundary of every interval.

For Interval

for interval can be greater or equal to group by time(xx). If it is less than group by interval, influx will raise an error like this: ERR: error parsing query: FOR duration must be >= GROUP BY time duration: must be a minimum of 20s, got 5s

Start Time and End Time in CQs

Here is a simple example, every 10 s for 45s group by time(20s)

execute time	selected start time	selected end time	real start time	real end time
16:00:30	15:59:45	16:00:30	16:00:00	16:00:40
16:00:40	15:59:55	16:00:40	16:00:00	16:00:40
16:00:50	16:00:05	16:00:50	16:00:20	16:01:00
16:01:00	16:00:15	16:01:00	16:00:20	16:01:00

We can see that, the execute interval is always 10s, but the start time and end time in CQ not equals to now()-45s-=now()=. It still based on group by time’s time boundary, but the start time must >= selected start time and end time is also >= selected end time.

Here is another example, every 5s for 10s group by time(10s)

execute time	selected start time	selected end time	real start time	real end time
16:00:00	15:59:50	16:00:00	16:59:50	16:00:00
16:00:05	15:59:55	16:00:05	16:00:00	16:00:10
16:00:10	16:00:00	16:00:10	16:00:00	16:00:10
16:00:15	16:00:05	16:00:15	16:00:10	16:00:20

I guess the reason why start time is always >= selected start time is to prevent pollute previous data. If the aggregated data is not enough, it will overwrite the correct data generated before. If there is not enough data in end time clause, it will be correct in the future.

Ref

C-m, RET and Return Key in Emacs

I use Emacs to write blog. In the recent update, I found M-RET no longer behave as leader key in org mode, but behave as org-meta-return. And even more strange is that in other mode, it behave as leader key. And M-RET also works in terminal in org mode. In GUI, pressing C-M-m can trigger leader key.

SO I opened this issue, with the help of these friends, the issue has been fixed. Here is the cause of the bug.

In Emacs, RET is not a key in keyboard, it’s a logical key). Emacs bind RET to C-m in source code. In terminal, <Enter> and C-m both send <CR> (ASCII 13) character, so <Enter> / <Return> key is equal to RET. In GUI, pressing <Enter> / <Return> key actually sends <return> to Emacs, and Emacs automatically translate <return> to RET.

This can be proved: type SPC h d k <Enter> in spacemacs, it will output RET (translated from <return>) runs the command org-open-at-point, which is an interactive compiled Lisp function in ‘org.el’.

Pressing C-m or <Enter> key usually given the same result, but you can also bind these with two different command. Take M-RET as example. If only <M-return> is bind, the M-RET is unbinded. If only M-RET is binded, then M-return is implicitly also bind to same command as M-RET.

In org mode scr:

(org-defkey org-mode-map (kbd "M-<return>") #'org-meta-return)
(org-defkey org-mode-map (kbd "M-RET") #'org-meta-return)

These two keys were binded to org-meta-return.

The unfixed Spacemacs configuration file binds C-M-m as dotspacemacs-major-mode-emacs-leader-key.

In GUI, the <Enter> key will send <return> to Emacs. Org mode has explicitly bind M-<return> to org-meta-return, so org-meta-return is triggered. In other mode, the M-<return> key binding is not defined, so <return> will translate to RET, then trigger leader key.

In the fixed version, dotspacemacs-major-mode-emacs-leader-key bind to M-<return> in GUI, and this override org mode’s binding. Finally meta return becomes leader key again.

Ref

Retrieve Large Dataset in Elasticsearch

It’s easy to get small dataset from Elasticsearch by using size and from. However, it’s impossible to retrieve large dataset in the same way.

Deep Paging Problem

As we know it, Elasticsearch data is organised into indexes, which is a logical namespace, and the real data is stored into physical shards. Each shard is an instance of Lucene. There are two kind of shards, primary shards and replica shards. Replica shards is the copy of primary shards in case nodes or shards fail. By distributing documents in an index across multiple shards, and distributing those shards across multiple nodes, Elasticsearch can ensure redundancy and scalability. By default, Elasticsearch create 5 primary shards and one replica shard for each primary shards.

How to decide which shard should the document be distributed? By default, shard = hashCode(doc._id) % primary_shards_number. To make this stable, the number of primary shards cannot be change the index has been created.

Usually, the shards size should be 20GB to 40GB. The number of shards a node can hold is depending on the heap space. In general, 1GB heap space can hold 20 shards.

As data is store in different shards. If there are 5 shards, when doing this query:

GET /_search?size=10

Each shards will generate 10 search result, and send results to coordinate node. The coordinate node will sort 50 items, and result the first 10 result to user. However when query become this:

GET /_search?size=10&from=10000

Although we only need 10 items, each shards has to return the first 10010 result to coordinate node, and coordinate node has to sort 50050 items, this search cost lots of resource.

As deep paging is costly, Elasticsearch has restrict from+size less than index.max-result-window, the default value is 10000.

Scroll

The search method has to retrieve and sort the result over and over again, because it does not know how to continue the search from previous position.

scroll is more efficient when retrieve large set of data.

For example:

POST /twitter/_search?scroll=1m
{
    "size": 100,
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    }
}

and the returned result will contains a _scroll_id, which should be passed to the scroll API in order to retrieve the rest of data.

POST /_search/scroll 
{
    "scroll" : "1m", 
    "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==" 
}

Scroll return the matched result at the time of the initial search request, like a snapshot, and ignore the subsequent changes to the documents(index, update or delete). The scroll=1m is used to tell how long should Elasticsearch keep the context. If there no following requests using the returned scroll_id, the scroll context will expire.

PS: In fact, when dealing the initial search request, scoll will cache all the matched documents’ id, then get the size document content in batches for each following requests.

Slice

It’s also possible to split the scroll in multiple slices and consume them independently.

GET /twitter/_search?scroll=1m
{
    "slice": {
        "id": 0, 
        "max": 2 
    },
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    }
}
GET /twitter/_search?scroll=1m
{
    "slice": {
        "id": 1,
        "max": 2
    },
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    }
}

The above request contains split the slice into 2 parts by using max:2 parameter. These union of two requests’ data is equivalent to the result of a scroll query without slicing.

The slice of the document can be calculated by this formula: slice(doc) = hash(doc._id) % max_slice. This is quiet similar to the calculation of shards mentioned before. For example if slice is 4, and shards is 2. Then slices 0,2 are assigned to first shard and slices 1,3 are assigned to second shard.

When slices number is n, each matched documents use a n bitset to remember which slice it belongs to. So you should limit the number of sliced query you perform in parallel to avoid the memory explosion.

Getting hash(doc._id) is expensive. You can also use another numeric doc_value field to do the slicing without hash function. For instance:

GET /twitter/_search?scroll=1m
{
    "slice": {
        "field": "date",
        "id": 0,
        "max": 10
    },
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    }
}

Query performance is most efficient when the number of slices is equal to the number of shards in the index. If that number is large (e.g. 500), choose a lower number as too many slices will hurt performance. Setting slices higher than the number of shards generally does not improve efficiency and adds overhead.

from Picking the number of slices

Search After

Scroll is not suitable for real-time user requests. After Elasticsearch 5, Search After API is added. It’s similar to scroll but provides a live cursor. It uses the results from the previous page to retrieve the next page.

To use search after, the query must be sorted, and the following query also contains search_after=previous sort value.

For example, if the initial query is this:

GET twitter/_search
{
    "size": 10,
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    },
    "sort": [
        {"date": "asc"},
        {"tie_breaker_id": "asc"}      
    ]
}

Then you have to extract the sort value of the last document, and pass it to search_after to get the next page result.

GET twitter/_search
{
    "size": 10,
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    },
    "search_after": [1463538857, "654323"],
    "sort": [
        {"date": "asc"},
        {"tie_breaker_id": "asc"}
    ]
}

Ref

Timezone in JVM

I wrote a Scala code to get the current time. However, the output is different on the development server and docker.

import java.util.Calendar

println(Calendar.getInstance().getTime)

On my development server, it outputs Sun Oct 18 18:01:01 CST 2020, but in docker, it print a UTC time.

I guess it related to the timezone setting and do a research, here is the result.

How Did JVM Detect Timezone

All of the code can be found in this function: private static synchronized TimeZone setDefaultZone()

  String zoneID = AccessController.doPrivileged(new GetPropertyAction("user.timezone"));

  // if the time zone ID is not set (yet), perform the
  // platform to Java time zone ID mapping.
  if (zoneID == null || zoneID.isEmpty()) {
      String javaHome = AccessController.doPrivileged(
              new GetPropertyAction("java.home"));
      try {
          zoneID = getSystemTimeZoneID(javaHome);
          if (zoneID == null) {
              zoneID = GMT_ID;
          }
      } catch (NullPointerException e) {
          zoneID = GMT_ID;
      }
}

First, it will check whether JVM has user.timezone property. If not, it will call this native method getSystemTimeZoneID, it was implemented in java.base/share/native/libjava/TimeZone.c, and the main logic is in java.base/unix/native/libjava/TimeZone_md.c.

In Timezone_md.c, it will find timezone by following steps, it will return the timezone immediately once found.

Find TZ environment.
Read /etc/timezone.
Read /etc/localtime. If it is a soft link(ex: /usr/share/zoneinfo/Asia/Shanghai), return timezone by path. Otherwise, compare the content with all files in /usr/share/zoneinfo, if found, return timezone.
Return GMT as timezone.

How to Change Timezone

The available timezone in Linux can be listed by this command: timedatectl list-timezones

Add JVM param

You can add -Duser.timezone=Asia/Shanghai as JVM parameters.

Set TZ environment variable

Add export TZ=Asia/Shanghai in .bashrc.

Change `/etc/timezone`

Set its content to Asia/Shanghai

Change `/etc/localtime`

Link it to /usr/share/zoneinfo/Asia/Shanghai

Change timezone manually in Java Program

All of these methods should work

Add this line before get time: TimeZone.setDefault(TimeZone.getTimeZone("Asia/Shanghai"))
Set JVM property by code System.setProperty("user.timezone", "Asia/Shanghai")
Set timezone manually in Calendar Calendar.getInstance(TimeZone.getTimeZone("Asia/Shanghai"))

Ref

Fix Error: Cask ‘java’ is unavailable in Homebrew

After update brew to latest version, when calling cask related command, it always outputs Error: Cask 'java' is unavailable: No Cask with this name exists., such as brew list --cask. However, the brew command works.

After doing some research, I found Java has been moved to homebrew/core. This makes sense now. I installed java by cask, but it’s not available now and cask throw this error. If I uninstall java from cask, the error should disappear.

This is not easy as cask is broken. Finally, I found this issue: brew cask upgrade fails with “No Cask with this name exists”. After running rm -rf "$(brew --prefix)/Caskroom/java, cask is back.

Improve Kafka throughput

Kafka is a high-performance and scalable messaging system. Sometimes when handling big data. The default configuration may limit the maximum performance. In this article, I’ll explain how messages are generate and saved in Kafka, and how to improve performance by changing configuration.

Kafka Internals

How does Producer Send Messages?

In short, messages will assembled into batches (named RecordBatch) and send to broker.

The producer manages some internal queues, and each queue contains RecordBatch that will send to one broker. When calling send method, the producer will look into the internal queue and try to append this message to RecordBatch which is smaller than batch.size (default value is 16KB) or create new RecordBatch.

There is also a sender thread in producer which is responsible for turning RecordBatch into requests (<broker node，List(ProducerBatch)>) and send to broker.

how are Records Saved?

The details can be found from these two articles: Apache Kafka - Message Format and A Guide To The Kafka Protocol - Apache Kafka - Apache Software Foundation.

Here are some important properties in RecordBatch are: batch_lenth, compresstion_type, CRC, timestamp and, of course, the List(Record).

Each Record consists of length, timestamp_delta, key(byte), value(byte) etc.

When look into the kafka topic data directory, you may find files like this:

00000000000000000000.log
00000000000000000000.index
00000000000000000000.timeindex
00000000000000000035.log
00000000000000000035.index
00000000000000000035.timeindex

Kafka saves each partition as segments. When new record comes, it append to the active segment. If the segment’s size limit is reached, a new segment is created as becomes the active segment. Segments are named by the offset of its first record, so the segments’ names are incremental.

Furthermore, the segment divided into three kinds of file: log file, index file and timeindex file.

The log file contains the actual data
The index file contains the record’s relative offset and its physical position in the log file. This makes the look up complexity for specific offset record to O(1).
The timeindex file contains the record’s relative offset and its timestamp.

How does Consumer pull messages?

Consumer keeps reading data from broker, and decompress data if necessary. It will put data into a internal queue and return the target number of records to client.

max.poll.records (default values is 500) means the maximum number of records returned in a single call to poll().

fetch.min.bytes (default value is 1) means the minimum amount of data the broker should return from a fetch request. If insufficient data is available, the server will wait up to fetch.max.wait.ms ms and accumulate the data before answering the request.

How to Improve Performance

Increase Socket Buffer

The default socket buffer value in Java client is too small for high-throughput environment. socket.receive.buffer.bytes (default value is 64KB) and send.buffer.bytes (default value is 128KB) is the SO_RCVBUFF and SO_SNDBUFF for socket connections respectively. I recommend to set it to a bigger value or -1 to use the OS default value.

batch.size, linger.ms and buffer.memory

As mentioned before, producer always send message as RecordBatch. Each batch should be smaller than batch.size (default value is 16KB). Increasing batch.size will not only reduce the TCP request to broker, but also lead to better compression ratio when compression is enabled.

linger.ms is used to specific the wait time before sending RecordBatch, and it will effect the real size of RecordBatch indirectly. The producer groups together any records that arrive in between request transmissions into a single batched request. If the system load is low and the RecordBatch is not full, the producer sender will still send this batch once it has been waited for linger.ms. linger.ms’s default value is 0, which means producer will send message as quick as possible(but the messaged arrived between two send requests will also be batched to RecordBatch). Increasing this value not only makes real batch size be close to batch.size and reducing the number of requests to be sent, but also increases the delay of messages.

The buffer.memory (default value is 32MB) controls the total amount of memory available to the producer for buffering. If records are sent faster than they can be transmitted to the server then this buffer space will be exhausted. When the buffer space is exhausted additional send calls will block.

Compression.type

As the throughput keep growing, bandwidth may become bottleneck. It’s easy to tackle this by add compresstion.type param in producer. Once it is configured, the producer will compressed the RecordBatch before sending it to broker. If the records are texts, the compression ratio should be high and bandwidth usage will be significantly decreased.

There are two kind of compresstion.type, topic level and producer level.

If you set compresstion.type in producer, the producer will compress the records and send it to broker.

There is also a topic level compresstion.type configuration. When it is set, producer’s compression type is not constrained. The broker will convert data sent from producer to target compresstion.type. compresstion.type can be set as gzip, snappy, lz4, zstd, uncompressed, and producer. The default value is producer, which means the broker will keep the original data send from the producer.

How to choose compression type? According to cloudflare’s test result in Squeezing the firehose: getting the most from Kafka compression:

type	CPU ration	Compression ratio
None	1x	1x
Gzip	10.14x	3.58x
Snappy	1.61x	2.35x
LZ4	2.51x	1.81x

Gzip has best compression ratio but take lots of CPU time. Snappy keeps a balance between the CPU time and space. The new compression type zstd added in Kafka 2.1 produce larger compression ratio than Snappy with the cost of a little more CPU time.

These are common configurations, you can find more from the official document contains such as max.in.flight.requests.per.connection.

Ref

Dynamic Allocate Executors when Executing Jobs in Spark

I wrote a Spark program to process logs. The number of logs always changes as time goes by. To ensure logs can be processed instantly, the number of executors is calculated by the maximum of logs per minutes. As a consequence, the CPU usage is low in executors. In order to decrease resource waste, I tried to find a way to schedule executors during the execution of program.

As shown below, the maximum number of logs per minutes can be a dozen times greater than the minimum number in one day.

If I can modify the executor number by size of data to proceed, the resource usage should increase.

Dynamic Allocation

Spark provide a similar configuration to control the number of executors. By enable spark.dynamicAllocation.enabled, spark will change number of running executors by task number automatically.

How does Dynamic Allocation Work?

Request Executors

As is known to all, the action operators(such as count, collect) create Spark job. Each job is divided into stages by shuffle operation, and each data partition in the stage will become independent jobs. When dynamic allocation is enabled, if there have been pending tasks for spark.dynamicAllocation.schedulerBacklogTimeout seconds, driver will request for more executors. If the pending task still exists, the executor request will be triggered every spark.dynamicAllocation.sustainedSchedulerBacklogTimeou seconds. Furthermore, the number of executors requested in each round increases exponentially from the previous round. For instance, an application will add 1 executor in the first round, and then 2, 4, 8 and so on executors in the subsequent rounds. The number of total running executor should not exceed spark.dynamicAllocation.maxExecutors.

When receiving the first executor request, driver ask cluster manager to create executor. After the new executor is created, driver checks if there are more request waiting to created and handle all of the pending request.

The reason to use this strategy to create executor is to avoid creating too many executor when payload just peak for a short time and make sure there are enough executor to be created in a period of time if the payload keeps high.

Release Executors

After the executor is idle for spark.dynamicAllocation.executorIdleTimeout seconds, it will be released. The one which contains cache data will not be removed. To prevent the executor which keeps the shuffle data from being removed, a additional spark service is needed before spark 3.0. From 3.0, the external shuffle service is not required if spark.dynamicAllocation.shuffleTracking.enabled is used.

Dynamic allocation is easy to used, but there are two disadvantage:

Slow scheduling. Creating executors is serial. If two or more executor is requested, driver will ask cluster manager to create executors for at least two times. This is an issue if pods creation takes time. In general, that is fine as the K8s 1.6 SLO is that 99% of pods should be created in 5s in a 5000 node cluster.
Hard to release executor if each task is short. The release is based on the idle time. If there are so many short task, the executor is not like to idle as tasks are assigned uniformly.

In our spark program, the task is short and data must be processed in 1 minutes. So dynamic allocation not suitable.

Manual Allocation

Luckily, spark also provide a way to control the number of executors manually. We can use sc.requestExecutors and sc.killExecutors to create and delete executors.

In order to use these two function, we have to know the number of running executors and their IDs.

Number of Running Executors

The Spark program’s RAM usage can be obtained from sc.getExecutorMemoryStatus. It returns a dict list like this: [Map(10.73.3.67:59136 -> (2101975449,2101612350))]. The key is IP with port and value is a tuple contains the max RAM and available RAM. Please note that driver is also included in the return data.

IDs of Running Executors

IDs is required when calling sc.killExecutors. This can be found in Spark REST API. The executors information such as ID, cores and tasks is record in /applications/[app-id]/executors.

With the help of sc.requestExecutors, we can create as many executors as we want in one request. But the pod create time is still too long. To eliminate the pod create request, I used these strategies:

The running executors is expected to finish job in 50s, fot the purpose of reversing some time for delayed tasks.
When the expected executor is close to current running executors, no executor is requests or released.
If there is backlog data, request more executors.

Result

After using manually allocation, the CPU usage grows a lot and reaches 40%. The cores used by Spark programs drop from 1700 to 800. Furthermore, the Spark program can scale automatically.

Internet Account Keeps Coming Back after deletion on MacOS

Today I tried to delete an inactive Internet account on system preference. It was deleted successfully but come back again after 20 seconds. This drives me nuts.

I tried these methods, but none of them works.

Boot in safe mode, delete account.
Delete record in ZACCOUNT table in ~~/Library/Accounts/Accounts4.sqlite~.
Delete related items in Keychain Access app.

Later, RedHatDude’s answer gives me a clue, it looks like a iCloud sync problem. I tried to delete the account on my 3 MacBooks together. Thank goodness! It does not show up again.

I managed to fix this issue today. Here is what I did:

Turn off keychain sync on all of my Apple devices (in the iCloud preferences). This included 2 Macs, an iPad and iPhone.

On each device I removed the accounts I no longer wanted

Then I turned keychain sync back on

This worked and all of my old/duplicate accounts are now gone.

– Gmail account keeps coming back after delete

Ref

QNAP TS-453Dmini Review

My first NAS is Synology DS120j, which is ARM based entry level product. It’s okay to use it for downloading and backup, but not power enough for running docker and virtual machine.

So I bought this NAS last month, and I’m satisfied with it. Here are the advantages and disadvantages.

Advantages

High performance.
It is equipped with J4125 quad-core 2.0 GHz processor, 8G RAM, two 2.5G Ports and 4 bays. Here is the spec. Although J4125 is not the fastest CPU in 2022(the newer model coming with N5105), it is still able to run several docker containers together, and I can even run Synology and Windows 10 inside build-in Virtualization Station.
Low Price.
I bought is at 2050 Yuan (about $320). Synology has similar model DS920+, which is 70% more expensive.
Qtier
A special feature only available on QNAP. It will automatically move hot and cold data between SSD and HDD to get higher performance. There is one noticeable change: this NAS is so quiet and you can hardly hear HDD running.

Disadvantages

No native HEIC Support
This means you can’t preview the photos imported from iPhone. You have to pay for $12 to buy CAYIN MediaSign Player to fix this.
Slow Start and Shutdown
I know it’s rarely to restart NAS, but taking several minutes to restart is not acceptable. I hope QNAP engineers can improve this in the future.

Docker Images Recommendation

Here are some docker images I think is useful:

johngong/qbittorrent

A torrent client which can also download new file through RSS feed.

p3terx/aria2-pro

Everyone knows what is aria2.

dreamacro/clash

A rule based proxy client.

jeessy/ddns-go

A simple DDNS client.

linuxserver/plex

Plex is app which helps you to manage and browser your media library. It can grab the metadata of TV shows, movies and music from Internet. It’s provide app in different platform so you can access your media from anywhere.

There is only one thing I think that needs to improve: playback speed is fixed, which is not convenient when watching animation.

You can also give Emby or Jellyfin a shot.

portainer/portainer-ce

QNAP has build-in docker-compose command. You can use this Web app if you prefer GUI.

vaultwarden/server

I deploy this on my VPS instead of NAS as port 443 is forbidden on NAS. It’s a alternative of 1Password. Although the app UI is not perfect, it has all the function required by a password manager. There is no official way to backup data, so I use crontab to run backup script to save data to Google Drive by Rclone.

The backup script is quite simple, you need to link your Google Drive account as google in Rclone before using this.

#!/bin/sh
pwd
echo backing up
rm bit.zip
zip -rq bit.zip ./data -x ./data/icon_cache/* ./data/bitwarden.log
rclone copy bit.zip google:/应用/vaultwarden

Emacs Chinese-related Settings

Auto Switch Input Method in Evil

This setting makes it possible to switch input method based on the context of cursor when entering insert mode.

sis

I’m using sis package with this configuration. You may need to install macism if you’re not using railwaycat/emacsmacport. More settings can be found in emacs-smart-input-source.

(sis-ism-lazyman-config
  "com.apple.keylayout.US"
  "com.apple.inputmethod.SCIM.ITABC")

(sis-global-cursor-color-mode t)
(sis-global-respect-mode t)
(sis-global-context-mode t)
(sis-global-inline-mode t)

fcitx

You can also install fcitx-remote for-osx and use cute-jumper/fcitx.el to do so. As homebrew no longer support some build options, you need to follow the install instructions in the GitHub repository to build fcitx.

Mono Chinese Font

I use a 14pt English font and 16pt Chinese font, one Chinese character is the same width as two English characters. It can be set by adding this into Emacs configuration file.

dotspacemacs-default-font '("Menlo"
                            :size 14.0
                            :weight normal
                            :width normal)

;; add into dotspacemacs/user-config()
(dolist (charset '(kana han symbol cjk-misc bopomofo))
  (set-fontset-font (frame-parameter nil 'font)
                    charset (font-spec :family "PingFang SC"
                                        :size 16)))

If you enable the chinese layer in Spacemacs, it provides a more convenient function:

(spacemacs//set-monospaced-font   "Menlo" "PingFang SC" 14 16)

PS: valign provides visual alignment for Org Mode and Markdown without changing fonts.

Ref

How to copy files temporarily in Dockerfile

It’s very common to copy a local file into the container when build docker image. In general, we use COPY command. But it creates a new layer and increase the final image size. If this is a temporal file and we don’t want users waste their storage space, how can we remove it? Here are some approaches.

Download the File Dynamically

If the file can be download from URL or you can create a local HTTP server to share the file, you can download the file, use it and delete it in one RUN command. For example:

RUN wget xxxx && unzip xxx && rm xxx

`RUN --mount` Command

You can also mount file when build image if your file can’t be download from Internet or the file is secret. Use it to bind files or directories to the build container.

A bind mount is read-only by default, add rw parameter to make it writable. The changes during the build are discared after the build is complete.

The mounted folder are kept in the image, but the files are gone. Don’t forget to delete the empty folder if you want to keep image clean.

Mount folder

RUN --mount=type=bind,target=/target_dir/,source=./source_dir/,rw

Mount file

RUN --mount=type=bind,target=/azure-cli.rpm,source=./docker/azure-cli.rpm tdnf install ca-certificates /azure-cli.rpm -y && tdnf clean all

`--squash` option in `docker build`

You can also use --squash to reduce image size. Once the build is complete, Docker creates a new image loading the diffs from each layer into a single new layer and references all the parent’s layers. So the extra space created by COPY command can be freed by squash.

Ref

Line Ending in Git

When working on a project with multiple developers, the line ending can be troublesome. This article will explain how to configure line ending in Git.

Basic configuration

The line ending on Windows is CRLF, on Linux is LF. To prevent the line ending issue, we can set core.autocrlf to true on Windows to let git convert CRLF to LF when commit, and convert LF to CRLF when checkout. It is automatically configured if you install git on Windows.

Configuring Git to handle line endings - GitHub Docs

# Configure Git to ensure line endings in files you checkout are correct for Windows.
# For compatibility, line endings are converted to Unix style when you commit files.
$ git config --global core.autocrlf true

Advanced configuration

You can also use .gitattributes to control the line ending in each repository. The .gitattributes file is a text file that tells Git how to handle files in the repository. You can specify the line ending of each file type in this file.

Auto convert line ending

With * text=auto, Git handles the files in whatever way it thinks is best. This is a good default option.

Use *.c text to explicitly declare a file as a text file, so this file is always normalized and converted to native line endings on checkout.

Use *.png binary to explicitly declare a file as binary, so Git does not convert it. (binray is an alias for -text -diff)

Force conversion when checkout:

You can use eol to force conversion when checkout. The following config enforces bat files to be converted to CRLF when checkout even on Mac and Linux.

* text=auto
*.bat eol=crlf

This is the result of git ls-files --eol on Windows and Linux:

git ls-files --eol src/azure-cli/az.bat
i/lf    w/crlf  attr/text=auto eol=crlf src/azure-cli/az.bat

i means the index, w means the working tree, attr means the attribute used when checking out or committing.

You can set eof to crlf or lf. If it’s not specified, the line ending will be determined by core.autocrlf or core.eol. If text is set but neither of those variables are set, then the default value is crlf on Windows and lf on Linux and Mac.

Refresh setting

If you change the .gitattributes file, you need to run the following command to refresh the working tree.

# Please commit the .gitattributes changes before run this command.
git rm -rf --cached .
git reset --hard HEAD

Extra

Line endings in tarball also follows the .gitattributes. It’s identical to Git checkout on Linux machine.
The .gitattributes settings will only affect new commits. If you want to change the line ending of the files that already in the Git index after changing line ending settings, you can use git add --renormalize . to force Git to refresh all tracked files. For example, if the bat file has been add as crlf in Git index and then you set it as text in .gitattributes. Running this command asks Git change it to lf in index.

Ref

Improve Git speed in WSL

The disk performance in WSL2 is poor, it takes a long time to run git status in a host’s repo. Moreover, if you set a fancy shell prompt, it will take a long time to show the prompt. This article will introduce how to speed up Git in WSL2.

How to speed up Git Command

The performance of file system in WSL2 is poor, it takes a long time to run git status in a host’s repo. The solution is to use git.exe in Windows folder. You can add this into your bashrc:

  function git() {
  if $(pwd -P | grep -q "^\/mnt\/c\/*"); then
    git.exe "$@"
  else
    command git "$@"
  fi
}

How to speed up Shell Prompt

If you have configured a fancy shell prompt, powerlevel10k for example, it will automatically get the git status when you enter a git repo. It will take a long time to show the prompt inside a host’s repo. You can accelerate it with two methods. The first one is disable git status in prompt. You may edit the .p10k.zsh file and comment the vcs prompt element. Therefor, it will not get git status when enter a git repo. However, you can’t see the git status though you are in WSL repo.

The second way is to disable untracked file check. You can run this command to disable it:

# stop checking for unstaged and staged changes
git config bash.showdirtystate false
# stop checking for untracked files
git config bash.showuntrackedfiles false

In this way, you can still see other git status such as branch name and staged files with a instant response.

Ref

iPod Video Review

I bought a iPod Video 5.5th Gen 80G recently. It’s only 200 Yuan (about $30) and I’m satisfied with it.

Rockbox

The original firmware supports few audio format, it even can’t play FLAC. I install rockbox on it, which support FLAC and other format and I can transfer music without using iTunes or Finder. It also support theme and plugin, which makes it more powerful.

MacPod error

If you restore the iPod on macOS, it raises Warning: This is a MacPod, Rockbox only runs on WinPods. See http://www.rockbox.org/wiki/IpodConversionToFAT32 during installation. The easiest way to fix this is to restore it on Windows.

Permission denied error

When I tried to install rockbox 1.5.1 on macOS, it raised could not open ipod permission denied when I clicked install. Using sudo /Applications/RockboxUtility.app/Contents/MacOS/RockboxUtility can fix this issue. Some said using 1.4.1 on other OS can fix this issue, but I haven’t tried it.

Hangs on `Waiting for system to remount player`

My iPod hangs on Waiting for system to remount player when I install rockbox. After timeout, I disconnected iPod and restart again. The startup screen shows Can't load rockbox.ipod: file not found. I connect iPod to computer and use rockbox utility to install rockbox again. I unchecked the bootloader, and only install rockbox, fonts and Plugin Data. The error is gone.

Theme

There are many themes in rockbox. I prefer fresh os light and adwaitapod Simplified. They also provide the dark version.

Custom font

See https://d00k.net/wiki/rockbox_advanced/font_combining/

Replace SSD

The original HDD is small, slow and fragile comparing with SSD, you can replace it with a SSD.

Different Adapters

CE to m2 adaptor (chip: JMB20330) and a 2242 m.2 SATA SSD (Recommended)

CE to TF card adaptor (adaptor is expensive but longer battery life)

CE/ZIF SSD (The product discontinued)

SSD Size

Not all iPod OS can support large SSD. It has the maximum track limit and SSD size limit in default OS. The track limitation stems from the RAM size, the large capacity model comes with more RAM and higher track threshold. For IPC 6th and 6.5th, if you release a SSD which is larger than 128G, iTunes only recognizes 128G. This is due to the LBA28 Limitation. Both Limitations can be eliminated by rockbox. If you want to stay in the original OS, I recommend you buying a 5.5th Gen 80GB or 7th Gen.

Model Description	Model No.	iTunes Storage Limit (see note below)
5th Gen 30Gb	MA002 / MA146 / PA002 / PA146	~20000 Tracks
5th Gen 60Gb	MA003 / MA147 / PA003 / PA147	~50000 Tracks
5.5th Gen 30Gb	MA444 / MA446 / PA444 / PA446 / MA664	~20000 Tracks
5.5th Gen 80Gb	MA448 / MA450 / PA448 / PA450	~50000 Tracks
6th Gen 80Gb	MB029 / MB147 / PB029	128Gb / ~50000 Tracks
6th Gen 160Gb	MB145 / MB150 / PB145 / PB150	128Gb / ~50000 Tracks / Requires Ribbon
6.5th Gen 120Gb	MB565 / PB565 / MB562 / PB562	128Gb / ~50000 Tracks
7th Gen 160Gb	PC297 / MC297 / PC293 / MC293	~50000 Tracks

Source: https://www.iflash.xyz/store/iflash-compatibility/

Guide

iPod 5th Generation (Video) Hard Drive Replacement

Replace battery

iPod comes with a 580 mah or 850 mah battery base on the thin or thick back cover of your iPod.

After replace the SSD, there is a bigger space for battery. You can replace a larger battery to get longer battery life. Here is the guide: https://www.ifixit.com/Guide/iPod+Classic+Battery+Replacement/561.

Some seller even provide a 3000 mah battery, I’m not sure whether your iPod has enough space for it, please as the seller before buying it. Some said the 3000 mah battery is fake, it’s actually a 1800 mah battery. Source: 3rd party extended Battery guide

Guide

iPod 5th Generation (Video) Hard Drive Replacement

Ref

Run Synology in QNAP NAS with PVE

Three years ago, I bought a QNAP TS-453Dmini NAS. Although it has a slow WEB UI and slow restart, it still fits my needs as all of the applications I need are running in Docker.

Recently, I want to move some files from my Mac to NAS to save space. I need a application behave like Dropbox, which can show all the files in the NAS and only download the files I need. I have tried the QSync, but it does not have thumbnails for cloud image and it does not have icons to show the file status. I also tried the Seafile, it’s a powerful application, which requires 4G RAM to run, and there is bug in the thumbnail. I used to have a Synology ARM NAS, the Synology Drive has all the features I need, so I want to run it on my QNAP NAS. After some research, I managed to run Synology and QNAP together on my NAS. Here is the guide.

Install PVE on QNAP

To run Synology on QNAP, the first step is to install PVE. I put the PVE iso to a Ventoy USB drive and reboot the NAS. When hearing the beep, press the DEL key immediately to enter the BIOS. Change the boot order to boot from the Ventoy and then install the PVE on the NAS. Remember to move the PVE disk to HDD bay 3 or 4, and the QNAP bios won’t set the bay 1 or 2 as boot disk.

Install Synology on PVE

It’s pretty easy to install Synology on PVE, there are many guides in the Internet. Thanks to the RR project, you can install Synology on any x86 hardware.

Install QNAP on PVE

I have one SSD and one HDD. The SSD is used for PVE and Synology, but I don’t have another disk to move the data to Synology. So I still need to run QNAP on PVE. Although the QNAP is not as popular as Synology, there is still a way to run it on PVE. I just followed this guide to install QNAP on PVE. To pass-through the HDD, just run ls -l /dev/disk/by-id/ to show all device and qm set {qnap vm id} -sata0 /dev/disk/by-id/{hdd_id}` to pass-through the HDD to QNAP VM. When restart the QNAP VM, it will ask you whether to load the OS from HDD, reset OS (but keep the data) or init the OS.

However, when I choose load the OS, it reboot and back to the same screen. I guess it might be a compatibility issue. Reset OS may work. For me, I create a new disk for the QNAP VM an install the QNAP OS, then I pass-through the HDD to the QNAP VM. Then in the Storage and Snapshot app, under Storage-Disks/VJOBD, on the right corner, click the More button and choose Recover. I managed to recover the data in the HDD. After that, I can use the QNAP as usual.

Synology OS vs QNAP OS

After using Synology for a while, I found the Synology OS is more user-friendly than QNAP OS. The UI is more responsive and the applications are more polished. There are more community applications available for Synology, and the overall experience feels more integrated. The Synology OS is more optimized. For example, the Synology backend services uses much less CPU and RAM than QNAP. I will migrate all of my data to Synology in the future once I got a new HDD.

Ref:

Kindle Paperwhite 5 Review

I bought a Kindle Paperwhite 1 in 2013, when I still in the university. I like it very much and I’ve read many programming books with it. It still works fine after 10 years, but I want to tries the new model with larger screen and faster fresh speed. So I bought a used Kindle Paperwhite 5 signature version for only 620 Yuan (about $90) recently. Here is my review.

The Kindle Paperwhite 5 has a 6.8 inch screen with 300 PPI, the screen size and the PPI both increase comparing with the previous model. The Kindle Paperwhite 6 has a larger 7 inch screen, and a double RM(1G) than Paperwhite 5(512M). You can buy a 16G model for 850 Yuan (about $120). I think the overall experience is good enough for reading books, so I choose PKW5.

The signature edition has a 32G storage, wireless charging and auto adjust light sensor. I haven’t used the wireless charging. I don’t think it’s necessary as the battery life is so long and I hardly need to charge it. The auto adjust light is not noticeable as well. For the storage, I think 8G is big enough if you don’t read manga.

The biggist difference between PKW5 and PKW1 is response speed. The KPW5 is much faster when openning books or click UI buttons. The page turning speed is also significantly improved.

Overall, I’m satisfied with the new model. If you want to buy a Kindle, I recommend you buying the KPW5 or KPW6, as a larger screen is worthy.

Kindle Jailbreak

If you want to read PDF on Kindle, KOReader is a must-have app. You can follow the Kindle Modding guide to jailbreak your Kindle and install KOReader.

There are some plugins I recommend:

Kindle FileBrowser: You can transfer books through wifi without using USB cable.
KOL: Open KOReader from main UI.
Cover Setter: Set a better cover for KUAL/KOL/HOTFIX

Kindle software

Besides the must have Calibre, there are two tools I recommend for Kindle users:

kaf-cli: A CLI tool to convert txt to epub, with a orly style cover.
Kindle Comic Converter: A tool to convert manga to mobi, which can be read on Kindle with full screen.

	Colab	Kaggle Kernel
GPU	Tesla T4(16G)	Tesla P100(16G)
RAM	13G	13G
Max training time	12h	9h
Export trained model	Google Drive	-

FilesExpand file tree

posts.org

Latest commit

History

posts.org

File metadata and controls

Posts

Machine Learning

Brief Introduction of Label Propagation Algorithm

Ref

LSTM and GRU

LSTM

GRU

Ref

Models and Architectures in Word2vec

Models

CBOW (Continuous Bag of Words)

Skip-Gram

Architectures

Hierarchical Softmax

Negative Sampling

Ref

Parameters in doc2vec

window

min_count

sample

negative

Ref

Semi-supervised text classification using doc2vec and label spreading

Keyword-based Classification

Classification by doc2vec and Label Spreading

TextCNN with PyTorch and Torchtext on Colab

Ref

Using Dueling DQN to Play Flappy Bird

Ref

Different types of Attention

Global(Soft) VS Local(Hard)

Content-based VS Location-based

Dot-Product

Scaled Dot-Product

Location-Base

General

Concat

Ref

The Annotated The Annotated Transformer

Input

Positional Encoding

Encoder

Multi-Head Attention

Add & Norm

Layer Normalization

Position-wise Feed Forward Network

Output Input

Decoder

Masked Multi-Head Attention

Key and Value in Decoder Multi-Head Attention Layer

Output

Ref

Near-duplicate with SimHash

Longest Common Subsequence(LCS)

Bag of Words(BoW)

Shingling (n-gram)

Hashing

MinHash

SimHash

How to generate features from document

How to find similar document

Ref

Programming

Create Node Benchmark in Py2neo

Deploy Nikola Org Mode on Travis

Install Org Mode plugin

Edit conf.el

Modify .travis.yml

Enable C Extension for gensim on Windows

Using Chinese Characters in Matplotlib

Program Crash Caused by CPU Instruction

Difference between Value and Pointer variable in Defer in Go

How it Works and Why Value or Pointer Receiver Matters

How to Exit Program and Run all Defer

Edit `conf.el`

Modify `.travis.yml`

Why `close()` is faster than `terminate()`

`args.getargspec`

Enum `format` change

`unittest.Mock`