As I said before, I’m working on a text classification project. I use doc2vec to convert text into vectors, then I use LPA to classify the vectors.
LPA is a simple, effective semi-supervised algorithm. It can use the density of unlabeled data to find a hyperplane to split the data.
Here are the main stop of the algorithm:
- Let $ (x_1,y1)…(x_l,y_l)$ be labeled data, $Y_L = \{y_1…y_l\} $ are the class labels. Let
$(x_{l+1},y_{l+u})$ be unlabeled data where$Y_U = \{y_{l+1}…y_{l+u}\}$ are unobserved, usually$l \ll u$ . Let$X=\{x_1…x_{l+u}\}$ where$x_i∈ R^D$ . The problem is to estimate$Y_U$ for$X$ and$Y_L$ . - Calculate the similarity of the data points. The most simple metric is Euclidean distance. Use a parameter
$σ$ to control the weights.
Larger weight allow labels to travel through easier.
- Define a
$(l+u)*(l+u)$ probabilistic transition matrix$T$
- Propagate
$Y ← TY$ - Row-normalize Y.
- Reset labeled data’s Y. Repeat 3 until Y converges.
In short, let the nearest label has larger weight, then calculate each label’s new label, reset labeled data’s label, repeat.
The avoid the problem of vanishing gradient and exploding gradient in vanilla RNN, LSTM was published, which can remember information for longer periods of time.
Here is the structure of LSTM:
The calculate procedure are:
Use
Compare with LSTM, GRU merge cell state and hidden state to one hidden state, and use
Generally, word2vec is a language model to predict the words probability based on the context. When build the model, it create word embedding for each word, and word embedding is widely used in many NLP tasks.
Use the context to predict the probability of current word. (In the picture, the word is encoded with one-hot encoding,
- Context words’ vectors are
$υ_{c-n} … υ_{c+m}$ ($m$ is the window size) - Context vector
$\hat{υ}=\frac{υ_{c-m}+υ_{c-m+1}+…+υ_{c+m}}{2m}$ - Score vector
$z_i = u_i\hat{υ}$ , where$u_i$ is the output vector representation of word$ω_i$ - Turn scores into probabilities
$\hat{y}=softmax(z)$ - We desire probabilities
$\hat{y}$ match the true probabilities$y$ .
We use cross entropy
For perfect prediction,
According to this, we can create this loss function:
Use current word to predict its context.
- We get the input word’s vector
$υ_c$ - Generate
$2m$ score vectors,$u_{c-m},…,u_{c-1},…,u_{c+m}$ . - Turn scores into probabilities
$\hat{y}=softmax(u)$ - We desire probabilities
$\hat{y}$ match the true probabilities$y$ .
Minimize
Encode words into a huffman tree, then each word has a Huffman code. The probability of it’s probability
Then the probability of negative classification is
足球’s Huffman code is
where
The probability of the 足球 is the production of these equation.
Generally,
This reduce the calculation complexity to
This method will choose some negative sample, then add the probability of the negative word into loss function. The optimisation target becomes maximise the positive words’ probability and minimise the negative words’ probability.
Let
where
—
- update 04-04-20
I found this two articles pretty useful: Language Models, Word2Vec, and Efficient Softmax Approximations and Word2vec from Scratch with NumPy.
- [word2vec 原理推导与代码分析](http://www.hankcs.com/nlp/word2vec.html)
- [CS 224D: Deep Learning for NLP Lecture Notes: Part I](http://cs224d.stanford.edu/lecture_notes/notes1.pdf)
- [word2vec 中的数学原理详解(一)目录和前言](http://blog.csdn.net/itplus/article/details/37969519)
Here are some parameter in gensim’s doc2vec class.
window is the maximum distance between the predicted word and context words used for prediction within a document. It will look behind and ahead.
In skip-gram model, if the window size is 2, the training samples will be this:(the blue word is the input word)
If the word appears less than this value, it will be skipped
High frequency word like the is useless for training. sample is a threshold for deleting these higher-frequency words. The probability of keeping the word
where
This is the plot when sample is 1e-3.
Usually, when training a neural network, for each training sample, all of the weights in the neural network need to be tweaked. For example, if the word pair is (‘fox’, ‘quick’), then only the word quick’s neurons should output 1, and all of the other word neurons should output 0.
But it would takes a lot of time to do this when we have billions of training samples. So, instead of update all of the weight, we random choose a small number of “negative” words (default value is 5) to update the weight.(Update their wight to output 0).
So when dealing with word pair (‘fox’,’quick’), we update quick’s weight to output 1, and other 5 random words’ weight to output 1.
The probability of selecting word
- [Word2Vec Tutorial - The Skip-Gram Model](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)
- [Word2Vec Tutorial Part 2 - Negative Sampling](http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/)
Here is a simple way to classify text without much human effort and get a impressive performance.
It can be divided into two steps:
- Get train data by using keyword classification
- Generate a more accurate classification model by using doc2vec and label spreading
Keyword based classification is a simple but effective method. Extracting the target keyword is a monotonous work. I use this method to automatic extract keyword candidate.
- Find some most common words to classify the text.
- Use this equation to calculate the score of each word appears in the text.
$$ score(i) = \frac{count(i)}{all\_count(i)^{0.3}}$$
where
$all\_count(i)$ is the word$i$ ’s word count in all corpus, and$count(i)$ is the word$i$ ’s word count in positive corpus. - Check the top words, add it to the final keyword list. Repeat this process.
Finally, we can use the keywords to classify the text and get the train data.
Keyword-based classification sometimes produces the wrong result, as it can’t using the semantic information in the text. Fortunately, Google has open sourced word2vec, which can be used to produce semantically meaningful word embeddings. Furthermore, sentences can also be converted to vectors by using doc2vec. Sentences which has closed meaning also have short vector distance.
So the problem is how to classify these vectors.
- Using corpus to train the
doc2vecmodel. - Using
doc2vecmodel to convert sentence into vector. - Using label spreading algorithm to train a classify model to classify the vectors.
PyTorch is a really powerful framework to build the machine learning models. Although some features is missing when compared with TensorFlow (For example, the early stop function, History to draw plot), its code style is more intuitive.
Torchtext is a NLP package which is also made by pytorch team. It provide a way to read text, processing and iterate the texts.
Google Colab is a Jupyter notebook environment host by Google, you can use free GPU and TPU to run your modal.
Here is a simple tutorial to build a TextCNN modal and run it on Colab.
The TextCNN paper was published by Kim in 2014. The model’s idea is pretty simple, but the performance is impressive. If you trying to solve the text classification problem, this model is a good choice to start with.
The main architecture is shown below:
It uses different kernels to extract text features, then use the softmax regression to classify text base on the features.
Now we can build this model step by step.
First build the model. The model I use is CNN-multichannel, which contains two sets of word embedding. Both of them is the copy of word embedding generate from corpus, but only one set will update embedding during training.
The code is below:
class textCNNMulti(nn.Module):
def __init__(self,args):
super().__init__()
dim = args['dim']
n_class = args['n_class']
embedding_matrix=args['embedding_matrix']
kernels=[3,4,5]
kernel_number=[150,150,150]
self.static_embed = nn.Embedding.from_pretrained(embedding_matrix)
self.non_static_embed = nn.Embedding.from_pretrained(embedding_matrix, freeze=False)
self.convs = nn.ModuleList([nn.Conv2d(2, number, (size, dim),padding=(size-1,0)) for (size,number) in zip(kernels,kernel_number)])
self.dropout=nn.Dropout()
self.out = nn.Linear(sum(kernel_number), n_class)
def forward(self, x):
non_static_input = self.non_static_embed(x)
static_input = self.static_embed(x)
x = torch.stack([non_static_input, static_input], dim=1)
x = [F.relu(conv(x)).squeeze(3) for conv in self.convs]
x = [F.max_pool1d(i, i.size(2)).squeeze(2) for i in x]
x = torch.cat(x, 1)
x = self.dropout(x)
x = self.out(x)
return xSecond, convert text into word index, so each sentence become a vector for training.
TEXT = data.Field(lower=True,batch_first=True)
LABEL = data.LabelField()
train, val, test = datasets.SST.splits(TEXT, LABEL, 'data/',fine_grained=True)
TEXT.build_vocab(train, vectors="glove.840B.300d")
LABEL.build_vocab(train,val,test)
train_iter, val_iter, test_iter = data.BucketIterator.splits(
(train, val, test), batch_sizes=(128, 256, 256),shuffle=True)Field defines how to process text, here is the most common parameters:
sequential – Whether the datatype represents sequential data. If False, no tokenization is applied. Default: True.
use_vocab – Whether to use a Vocab object. If False, the data in this field should already be numerical. Default: True.
preprocessing – The Pipeline that will be applied to examples using this field after tokenizing but before numericalizing. Many Datasets replace this attribute with a custom preprocessor. Default: None.
batch_first – Whether to produce tensors with the batch dimension first. Default: False.
datasets.SST.splits will load the SST datasets, and split into train, validation, and test Dataset objects.
build_vocab will create the Vocab object for Field, which contains the information to convert word into word index and vice versa. Also, the word embedding will save as Field.Vocab.vectors. vectors contains all of the word embedding. Torchtext can download some pretrained vectors automatically, such as glove.840B.300d, fasttext.en.300d. You can also load your vectors in this way, xxx.vec should be the standard word2vec format.
from torchtext.vocab import Vectors
vectors = Vectors(name='xxx.vec', cache='./')
TEXT.build_vocab(train, val, test, vectors=vectors)data.BucketIterator.splits will returns iterators that loads batches of data from datasets, and the text in same batch has similar lengths.
Now, we can start to train the model. First we wrap some parameters into args, it contains settings like output class, learning rate, log interval and so on.
args={}
args['vocb_size']=len(TEXT.vocab)
args['dim']=300
args['n_class']=len(LABEL.vocab)
args['embedding_matrix']=TEXT.vocab.vectors
args['lr']=0.001
args['momentum']=0.8
args['epochs']=180
args['log_interval']=100
args['test_interval']=500
args['save_dir']='./'Finally, we can train the model.
model=textCNNMulti(args)
model.cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=args['lr'],momentum=args['momentum'])
criterion = nn.CrossEntropyLoss()
steps=0
for epoch in range(1, args['epochs']+1):
for i,data in enumerate(train_iter):
steps+=1
x, target = data.text, data.label
x=x.cuda()
target.sub_(1)
target=target.cuda()
output = model(x)
loss = criterion(output, target)
optimizer.zero_grad()
loss.backward()
optimizer.step()You can found textcnn.ipynb on GitHub or Colab.
- Convolutional Neural Networks for Sentence Classification
- Understanding Convolutional Neural Networks for NLP
- Torchtext Docs
- Castor
PyTorch provide a simple DQN implementation to solve the cartpole game. However, the code is incorrect, it diverges after training (It has been discussed here).
The official code’s training data is below, it’s high score is about 50 and finally diverges.
There are many reason that lead to divergence.
First it use the difference of two frame as input in the tutorial, not only it loss the cart’s absolute information(This information is useful, as game will terminate if cart moves too far from centre), but also confused the agent when difference is the same but the state is varied.
Second, small replay memory. If the memory is too small, the agent will forget the strategy it has token in some state. I’m not sure whether 10000 memory is big enough, but I suggest using a higher value.
Third, the parameters. learning_rate, target_update_interval may cause fluctuation. Here is a example on Stack Overflow. I also met this problem when training cartpole agent. The reward stops growing after 1000 episode.
After doing some research on the cartpole DNQ code, I managed to made a model to play the flappy bird. Here are the changes from the original cartpole code. Most of the technology can be found in these two papers: Playing Atari with Deep Reinforcement Learning and Rainbow: Combining Improvements in Deep Reinforcement Learning.
Here is the model architecture:
Here is a trained result:
- Dueling DQN
The vanilla DQN has the overestimate problem. As the
maxfunction will accumulate the noise when training. This leads to converging at suboptimal point. Two following architectures are submitted to solve this problem.
Double DQN was published two year later DQN. It has two value function, one is used to choose the action with max Q value, another one is used to calculate the Q value of this action.
Dueling DQN is another solution. It has two estimator, one estimates the score of current state, another estimates the action score.
In order to distinguish the score of the actions, the return the Q-value will minus the mean action score:
x=val+adv-adv.mean(1,keepdim=True)
In this project, I use dueling DQN.
- Image processing
I grayscale and crop the image.
- Stack frames
I use the last 4 frame as the input. This should help the agent to know the change of environment.
- Extra FC before last layer
I add a FC between the image features and the FC for calculate Q-Value.
- Frame Skipping
Frame-skipping means agent sees and selects actions on every k frame instead of every frame, the last action is repeated on skipped frames. This method will accelerate the training procedure. In this project, I use
frame_skipping=2, as the more the frame skipping is, the more the bird is likely to hit the pipe. And this method did help the agent to converge faster. More details can be found in this post. - Prioritized Experience Replay
This idea was published here. It’s a very simple idea: replay high TD error experience more frequently. My code implementation is not efficient. But in cartpole game, this technology help the agent converge faster.
- Colab and Kaggle Kernel
My MacBook doesn’t support CUDA, so I use these two website to train the model. Here are the comparison of them. During training, Kaggle seems more stable, Colab usually disconnected after 1h.
Colab Kaggle Kernel GPU Tesla T4(16G) Tesla P100(16G) RAM 13G 13G Max training time 12h 9h Export trained model Google Drive -
—
The lesson I learnt from this project is patience. It takes a long time(maybe hundreds of thousand steps) to see whether this model works, and there are so many parameters can effect the final performance. It takes me about 3 weeks to build the final model. So if you want to build your own model, be patient and good luck. Here are two articles talking about the debugging and hyperparameter tuning in DQN:
Here are something may help with this task.
- TensorBoard
It’s a visualization tool made by TensorFlow Team. It’s more convenient to use it rather than generate graph manually by matplotlib. Besides
rewardandmean_q, these variable are also useful when debugging: TD-error, loss and action_distribution, avg_priority. - Advanced image pre-processing
In this project, I just grayscalize the image. A more advance technology such as binarize should help agent to filter unimportant detail of game output.
In Flappy Bird RL, the author extract the vertical distance from lower pipe and horizontal distance from next pair of pipes as state. The trained agent can achieve 3000 score.
- Other Improvements
Rainbow introduce many other extensions to enhance DQN, some of them have been discussed in this post.
I’ve uploaded code to this repo.
—
- Update 26-04-19
Colab’s GPU has upgrade to Tesla T4 from K80, now it becomes my best bet.
- Update 07-05-19
TensorBoard is now natively supported in PyTorch after version 1.1
- Update 26-07-19
If you run out of RAM in Colab, it will show up an option to double the RAM.
- Update 13-08-19
Upload video, update code.
- PyTorch REINFORCEMENT LEARNING (DQN) TUTORIAL
- 强化学习 (A series of Chinese post about reinforcement learning)
- Deep Reinforcement Learning for Flappy Bird
- Flappy-Bird-Double-DQN-Pytorch
- DeepRL-Tutorials
- Speeding up DQN on PyTorch: how to solve Pong in 30 minutes
- Frame Skipping and Pre-Processing for Deep Q-Networks on Atari 2600 Games
- OpenAI Baselines: DQN
- Deep-Reinforcement-Learning-Hands-On
- DQN solution results peak at ~35 reward
(n,1).
Global Attention takes all source hidden states into account, and local attention only use part of the source hidden states.
Content-based Attention uses both source hidden states and target hidden states, but location-based attention only use source hidden states.
Here are several popular attention mechanisms:
(n,n)
(x,1), and (x,x). This is similar to a neural network with one hidden layer.
When I doing a slot filling project, I compare these mechanisms. Concat attention produce the best result.
- Attention Variants
- Attention? Attention!
- Attention Seq2Seq with PyTorch: learning to invert a sequence
Thanks for the articles I list at the end of this post, I understand how transformers works. These posts are comprehensive, but there are some points that confused me.
First, this is the graph that was referenced by almost all of the post related to Transformer.
Transformer consists of these parts: Input, Encoder*N, Output Input, Decoder*N, Output. I’ll explain them step by step.
The input word will map to 512 dimension vector. Then generate Positional Encoding(PE) and add it to the original embeddings.
The transformer model does not contains recurrence and convolution. In order to let the model capture the sequence of input word, it add PE into embeddings.
PE will generate a 512 dimension vector for each position:
$$\begin{align*}
PE_{(pos,2i)} = sin(pos / 10000^{2i/d_{model}})
PE_{(pos,2i+1)} = cos(pos / 10000^{2i/d_{model}})
\end{align*}$$
The even and odd dimension use sin and cos function respectively.
For example, the second word’s PE should be:
The value range of PE is (-1,1), and each position’s PE is slight different, as cos and sin has different frequency. Also, for any fixed offset k,
For even dimension, let
The PE implementation in tensor2tensor use sin in first half of dimension and cos in the rest part of dimension.
There are 6 Encoder layer in Transformer, each layer consists of two sub-layer: Multi-Head Attention and Feed Forward Neural Network.
Let’s begin with single head attention. In short, it maps word embeddings to q k v and use q k v vector to calculate the attention.
The input words map to q k v by multiply the Query, Keys Values matrix. Then for the given Query, the attention for each word in sentence will be calculated by this formula: q k v is a 64 dimension vector.
Matrix view:
The single head attention only output a 64 dimension vector, but the input dimension is 512. How to transform back to 512? That’s why transformer has multi-head attention.
Each head has its own (512, 64)) the concat the outputted vectors as (512, 512)) and the result is
Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.
The whole procedure looks like this:
This layer works like this line of code: norm(x+dropout(sublayer(x))) or x+dropout(sublayer(norm(x))). The sublayer is Multi-Head Attention or FF Network.
Layer Norm is similar to Batch Normalization, but it tries to normalize the whole layer’s features rather than each feature.(Scale and Shift also apply for each feature) More details can be found in this paper.
This layer is a Neural Network whose size is (512, 2048, 512). The exact same feed-forward network is independently applied to each position.
Same as Input.
The decoder is pretty similar to Encoder. It also has 6 layers, but has 3 sublayers in each Decoder. It add a masked multi-head-attention at the beginning of Decoder.
This layer is used to block future words during training. For example, if the output is <bos> hello world <eos>. First, we should use <bos> as input to predict hello, hello world <eos> will be masked to 0.
In Encoder, the q k v vector is generated by q k v was generated by <init of sentence> or previous output.
The animation below illustrates how to apply the Transformer to machine translation.
Using a linear layer to predict the output.
- The Annotated Transformer
- The Illustrated Transformer
- The Transformer – Attention is all you need
- Seq2seq pay Attention to Self Attention: Part 2
- Transformer模型的PyTorch实现
- How to code The Transformer in Pytorch
- Deconstructing BERT, Part 2: Visualizing the Inner Workings of Attention
- Transformer: A Novel Neural Network Architecture for Language Understanding
- Dive into Deep Learning - 10.3 Transformer
- 10分钟带你深入理解Transformer原理及实现
Before talking about SimHash, let’s review some other methods which can also identify duplication.
This is the algorithm used by diff command. It is also edit distance with insertion and deletion as the only two edit operations.
This works good for short strings. However, the algorithm’s time complexity is
Transform document into the words it contains, then using Jaccard Similarity to calculate the similarity.
For example, if document A contains {a,b,c} and B contains {a,b,d}, then
BoW drops the word context information. In order to take word context into consideration, we convert sentences into phrases. For instance, roses are red and violets are blue will convert to roses are red, are red and, red and voilets …
Saving shingling result take k times disk space if using k words phrase. To solve this problem, save phrase’s hashing value instead of string.
The larger the document is, the more the hashing needs to compare. Is there a way to map documents to constant value? MinHash tackles this problem.
It uses
Compare with Hashing, MinHash successfully reduce the time complexity and storage complexity to
For a given document, how to find it’s most similar document? If using MinHash, we need to travel the whole corpus. Is there any more effective method? SimHash comes to the rescue.
For a set of input hashes, SimHash will generate a fingerprint(f-bits vector) for the input And the produced hashes has a property: similar input hashes generate similar fingerprint. So the dissimilarity of two documents can be calculated by the XOR of two fingerprint. In google’s Detecting Near-Duplicates for Web Crawling paper, they map 8B web-pages to 64 bits. If two bits differ less than 3 bits, then two web-pages are similar.
The calculation of SimHash is quiet simple. Given a set of features extracted from the document and their weights, we’ll maintain f-bits vector 1 and negative numbers to 0 to get the final hash value.
One easy way to do this is to use a window to get sub-string from document. For each sub-string, using the hash value of string as features, and the count of this string as weight.
For example, if we has this sentence: kk really rocks!.
First, pre-processing this sentence to kkreallyrocks.
Then using a window of 4 to generate sub-string from the sentence. We’ll get the sub-string and their count: (kkre, 1), (krea, 1), (real, 1) etc.
Suppose we only get these first 3 sub-string and their hash values are 1001, 0101 and 1101 respectively. Then the final 1101
Iterating over all document and compare with target simhash value is a time consuming operation. Is there any smart way to accomplish this task? In Google’s paper, they published a very neat algorithm.
If the hash value is a 64-bit vector, and we want to find the document which is 2-bit differs with the target. Then we can divided the vector to 4 part:
Suppose part
Besides
Depending on the fingerprints’ bit and documents number, you need to find a optimal number to split the hash value.
Recently, I’m working on a neo4j project. I use Py2neo to interact with graph db. Although Py2neo is a very Pythonic and easy to use, its performance is really poor. Sometimes I have to manually write cypher statement by myself if I can’t bear with the slow execution. Here is a small script which I use to compare the performance of 4 different ways to insert nodes.
import time
from graph_db import graph
from py2neo.data import Node, Subgraph
def delete_label(label):
graph.run('MATCH (n:{}) DETACH DELETE n'.format(label))
def delete_all():
print('delete all')
graph.run('match (n) detach delete n')
def count_label(label):
return len(graph.nodes.match(label))
def bench_create1():
print('Using py2neo one by one')
delete_label('test')
start = time.time()
tx = graph.begin()
for i in range(100000):
n = Node('test', id=i)
tx.create(n)
tx.commit()
print(time.time() - start)
print(count_label('test'))
delete_label('test')
def bench_create2():
print('Using cypher one by one')
delete_label('test')
start = time.time()
tx = graph.begin()
for i in range(100000):
tx.run('create (n:test {id: $id})', id=i)
if i and i % 1000 == 0:
tx.commit()
tx = graph.begin()
tx.commit()
print(time.time() - start)
print(count_label('test'))
delete_label('test')
def bench_create3():
print('Using Subgraph')
delete_label('test')
start = time.time()
tx = graph.begin()
nodes = []
for i in range(100000):
nodes.append(Node('test', id=i))
s = Subgraph(nodes=nodes)
tx.create(s)
tx.commit()
print(time.time() - start)
print(count_label('test'))
delete_label('test')
def bench_create4():
print('Using unwind')
delete_label('test')
start = time.time()
tx = graph.begin()
ids = list(range(100000))
tx.run('unwind $ids as id create (n:test {id: id})', ids=ids)
tx.commit()
print(time.time() - start)
print(count_label('test'))
delete_label('test')
def bench_create():
create_tests = [bench_create1, bench_create2, bench_create3, bench_create4]
print('testing create')
for i in create_tests:
i()
if __name__ == '__main__':
bench_create()Apparently, using cypher with unwind keyword is the fastest way to batch insert nodes.
testing create
Using py2neo one by one
96.09799289703369
100000
Using cypher one by one
9.493892192840576
100000
Using Subgraph
7.638832092285156
100000
Using unwind
2.511630058288574
100000
The above result is based on http protocol. A very interesting result is that, bolt protocol will decrease the time of the first method, but double the time of second method. That’s wired, maybe py2neo has some special optimization when doing batch insert on bolt protocol? But I have no idea why insert one by one with cypher is 2x slower. Here is the result of bolt protocol.
testing create
Using py2neo one by one
51.73185706138611
100000
Using cypher one by one
22.051995992660522
100000
Using Subgraph
8.81674599647522
100000
Using unwind
2.8623900413513184
100000
Recently, I enjoy using Spacemacs, so I decided to switch to org file from Markdown for writing blog. After several attempts, I managed to let Travis convert org file to HTML. Here are the steps.
First you need to install Org Mode plugin on your computer following the official guide: Nikola orgmode plugin.
Org Mode will convert to HTML to display on Nikola. Org Mode plugin will call Emacs to do this job. When I run nikola build, it shows this message: Please install htmlize from https://github.com/hniksic/emacs-htmlize. I’m using Spacemacs, the htmlize package is already downloaded if the org layer is enabled. I just need to add htmlize folder to load-path. So here is the code:
(setq dir "~/.emacs.d/elpa/27.0/develop/")
(if(file-directory-p dir)
(let ((default-directory dir))
(normal-top-level-add-subdirs-to-load-path)))
(require 'htmlize)This package is also needed on Travis, the similar approach is required.
Travis is using ubuntu 14.04, and the default Emacs version is 24, and the Org Mode version is below 8.0, which not match the requirements. The easiest solution is to update Emacs to 25. So in the before_install section, add these code:
- sudo add-apt-repository ppa:kelleyk/emacs -y
- sudo apt-get updateIn the install section, add these code:
- sudo apt-get remove emacs
- sudo apt autoremove
- sudo apt-get install emacs25The default emacs doesn’t contains htmlize package. So add git clone https://github.com/hniksic/emacs-htmlize ~/emacs-htmlize into before_install section.
Finally, modify conf.el for Travis Emacs, add GitHub repo to load-path: (add-to-list 'load-path "~/emacs-htmlize/")
Voila, the org file should show up.
The full .travis.yml is below:
language: python
cache: apt
sudo: false
addons:
apt:
packages:
- language-pack-en-base
branches:
only:
- src
python:
- 3.6
before_install:
- sudo add-apt-repository ppa:kelleyk/emacs -y
- sudo apt-get update
- openssl aes-256-cbc -K $encrypted_a5c638e4bedc_key -iv $encrypted_a5c638e4bedc_iv
-in travis.enc -out travis -d
- git config --global user.name 'bebound'
- git config --global user.email 'bebound@gmail.com'
- git config --global push.default 'simple'
- pip install --upgrade pip wheel
- echo -e 'Host github.com\n StrictHostKeyChecking no' >> ~/.ssh/config
- eval "$(ssh-agent -s)"
- chmod 600 travis
- ssh-add travis
- git remote rm origin
- git remote add origin git@github.com:bebound/bebound.github.io
- git fetch origin master
- git branch master FETCH_HEAD
- git clone https://github.com/hniksic/emacs-htmlize ~/emacs-htmlize
install:
- pip install 'Nikola[extras]'==7.8.15
- sudo apt-get remove emacs
- sudo apt autoremove
- sudo apt-get install emacs25
script:
- nikola build && nikola github_deploy -m 'Nikola auto deploy [ci skip]'
notifications:
email:
on_success: change
on_failure: alwaysAnd here is the conf.el:
(setq dir "~/.emacs.d/elpa/27.0/develop/")
(if(file-directory-p dir)
(let ((default-directory dir))
(normal-top-level-add-subdirs-to-load-path)))
(add-to-list 'load-path "~/emacs-htmlize/")
(require 'htmlize)These days, I’m working on some text classification works, and I use gensim ’s doc2vec function.
When using gensim, it shows this warning message:
C extension not loaded for Word2Vec, training will be slow.
I search this on Internet and found that gensim has rewrite some part of the code using cython rather than numpy to get better performance. A compiler is required to enable this feature.
I tried to install mingw and add it into the path, but it’s not working.
Finally, I tried to install Visual C++ Build Tools and it works.
If this output is not -1, then it’s fine.
from gensim.models import word2vec
print(word2vec.FAST_VERSION)After searching from Google, here is easiest solution. This should also works on other languages:
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib.font_manager as fm
f = "/System/Library/Fonts/PingFang.ttc"
prop = fm.FontProperties(fname=f)
plt.title("你好",fontproperties=prop)
plt.show()Output:
It’s inevitable to dealing with bugs in coding career. The main part of coding are implementing new features, fixing bugs and improving performance. For me, there are two kinds of bugs that is difficult to tackle: those are hard to reproduce, and those occur in code not wrote by you.
Recently, I met a bug which has both features mentioned before. I write a Spark program to analyse the log and cluster them. Last week I update the code, use Facebook’s faiss library to accelerate the process of find similar vector. After I push the new code to spark, the program crashed. I found this log on Spark driver:
java.io.EOFException ERROR PythonRunner: Python worker exited unexpectedly (crashed).
Because the Python Worker is created by Spark JVM, I can’t get the internal state of Python Worker. By inserting log into Code, I get the rough position of crash code. But the code looks good.
I have tested the code on my develop environment. My develop machine is Using Spark 2.4. but the Spark platform is using Spark 3.0. I guess maybe there is some compatible problem on Spark 3.0. So I use the same docker images as Spark platform to run the code. The code works as expected without crash. That’s wired, the docker has isolate the environment, how could same docker image produce different output?
I search the error from google, some said it’s because spark is running out of memory. This doesn’t seem correct, this update shouldn’t increase the RAM usage. I still gave it a try and no luck.
Alright, this update add faiss to the code, maybe faiss lead to the crash, as Python doesn’t raise any other. If the crash is caused by the C code in faiss, this makes sense. First, I write a code with spark and faiss, the program crashed. Then I wrote a code only contains faiss, it still crashed. So I can confirm that the crash is cause by faiss and Spark is innocent. Even stranger, when running on Spark platform, sometimes the script crashes, sometimes not.
But why faiss only crash on the Spark Platform? I ask the colleague to know the detail of the failed job and know that the docker’s exit code is 132. 132 means illegal instruction. I search illegal instruction on faiss’s GitHub issue. I found this issue: Illegal instruction (core dumped).
By compare the host server’s CPU instruction. The crashed ones lack of avx2 instruction. avx2 is added after the Intel Fourth generation core (Haswell). The develop server is using sixth generation CPU, and some platform server is too to support this instruction. By adding a parameter to enforce the script scheduling on new server, the crash disappears.
PS: Running faiss code index.add(xx) will not trigger the crash, but calling faiss.search(xx) does. When I trying to locate the code which cause the crash, the faiss package was imported correctly and the index is built normally. This mislead me to believe that faiss code is working.
defer is a useful function to do cleanup, as it will execute in LIFO order before the surrounding function returns. If you don’t know how it works, sometimes the execution result may confuse you.
I found an interesting code on Stack Overflow:
type X struct {
S string
}
func (x X) Close() {
fmt.Println("Value-Closing", x.S)
}
func (x *X) CloseP() {
fmt.Println("Pointer-Closing", x.S)
}
func main() {
x := X{"Value-X First"}
defer x.Close()
x = X{"Value-X Second"}
defer x.Close()
x2 := X{"Value-X2 First"}
defer x2.CloseP()
x2 = X{"Value-X2 Second"}
defer x2.CloseP()
xp := &X{"Pointer-X First"}
defer xp.Close()
xp = &X{"Pointer-X Second"}
defer xp.Close()
xp2 := &X{"Pointer-X2 First"}
defer xp2.CloseP()
xp2 = &X{"Pointer-X2 Second"}
defer xp2.CloseP()
}The output is:
Pointer-Closing Pointer-X2 Second
Pointer-Closing Pointer-X2 First
Value-Closing Pointer-X Second
Value-Closing Pointer-X First
Pointer-Closing Value-X2 Second
Pointer-Closing Value-X2 Second
Value-Closing Value-X Second
Value-Closing Value-X First
Take a look at line 5-6, why Pointer-Closing Value-X2 Second was printed twice? According to Effective Go, ”The arguments to the deferred function (which include the receiver if the function is a method) are evaluated when the defer executes, not when the call executes.”. And the function’s parameters will saved anew when evaluated.
As x2 is value and the defer function CloseP’s receiver is a pointer, once defer executes, it will create a pointer which points to x2 as function’s caller. In the following defer, it will create a pointer which point to x2 again. Although x2.S change to “Second”, x2’s address never changes. Finally, when these two defer is called, the same log was printed again.
From Golang Runtime:
runtime.Goexit()terminates the goroutine that calls it. No other goroutine is affected. Goexit runs all deferred calls before terminating the goroutine. Because Goexit is not a panic, any recover calls in those deferred functions will return nil.Calling Goexit from the main goroutine terminates that goroutine without func main returning. Since func main has not returned, the program continues execution of other goroutines. If all other goroutines exit, the program crashes.
If you want the program to exit normally, just add defer os.Exit(0) at the top of main function. Here is the example code:
package main
import (
"fmt"
"os"
"runtime"
"time"
)
func subGoroutine() {
defer fmt.Println("exit sub routine")
for {
fmt.Println("sub goroutine running")
time.Sleep(1 * time.Second)
}
}
func main() {
defer os.Exit(0)
defer fmt.Println("calling os.Exit")
go subGoroutine()
time.Sleep(2 * time.Second)
runtime.Goexit()
}Output:
sub goroutine running sub goroutine running sub goroutine running calling os.Exit Process finished with exit code 0
The defer code in main goroutine are executed, but those in subGoroutine will not be executed. As os.Exit will
Exit causes the current program to exit with the given status code. Conventionally, code zero indicates success, non-zero an error. The program terminates immediately; deferred functions are not run.
from godoc
- 面向信仰编程 defer
- Golang defer clarification
- How to exit a go program honoring deferred calls?
- Effective Go
- Golang Runtime
- Go defer 遇上 os.Exit 時失效
CSRF(Cross-site request forgery) is a way to generate fake user request to target website. For example, on a malicious website A, there is a button, click it will send request to www.B.com/logout. When the user click this button, he will logout from website B unconsciously. Logout is not a big problem, but malicious website can generate more dangerous request like money transfer.
Each web framework has different approach to do CSRF protection. In Django, the validation process is below:
- When user login for the first time, Django generate a
csrf_secret, add random salt and encrypt it as A, save A to cookiecsrftoken. - When Django processing tag
{{ csrf_token }}or{% csrf_token %}, it readcsrftokencookie A, reverse it tocsrf_secret, add random salt and encrypt it as B, return corresponding HTML. - When Django receive POST request, it will retrieve cookie
csrftokenas A, and tries to getcsrfmiddlewaretokenvalue B from POST data, if it does not exist, it will get headerX-CSRFTokenvalue as B. Then A and B will be reversed tocsrf_secret. If the values are identical, the validation is passed. Otherwise, a 403 error will raise.
<form>
{% csrf_token %}
</form>$.ajax({
data: {
csrfmiddlewaretoken: '{{ csrf_token }}'
},Extracting csrftoken from cookie and add it to header for each ajax request.
function getCookie(name) {
var cookieValue = null;
if (document.cookie && document.cookie !== '') {
var cookies = document.cookie.split(';');
for (var i = 0; i < cookies.length; i++) {
var cookie = jQuery.trim(cookies[i]);
// Does this cookie string begin with the name we want?
if (cookie.substring(0, name.length + 1) === (name + '=')) {
cookieValue = decodeURIComponent(cookie.substring(name.length + 1));
break;
}
}
}
return cookieValue;
}
var csrftoken = getCookie('csrftoken');
function csrfSafeMethod(method) {
// these HTTP methods do not require CSRF protection
return (/^(GET|HEAD|OPTIONS|TRACE)$/.test(method));
}
$.ajaxSetup({
beforeSend: function(xhr, settings) {
if (!csrfSafeMethod(settings.type) && !this.crossDomain) {
xhr.setRequestHeader("X-CSRFToken", csrftoken);
}
}
});- Cross Site Request Forgery protection
- csrf.py
- What’s the relationship between csrfmiddlewaretoken and csrftoken?
- CPython allocation memory to save dictionary, the initial table size is 8, entries are saved as
<hash,key,value>in each slot(The slot content changed after Python 3.6). - When a new key is added, python use
i = hash(key) & maskwheremask=table_size-1to calculate which slot it should be placed. If the slot is occupied, CPython using a probing algorithm to find the empty slot to store new item. - When 2/3 of the table is full, the table will be resized.
- When getting item from dictionary, both
hashandkeymust be equal.
When elements size is below 50000, the table size will increase by a factor of 4 based on used slots. Otherwise, it will increase by a factor of 2. The dictionary size is always
| dict size | resize when elements in dict | new table size |
| 8 | 6 | 32 |
| 32 | 22 | 128 |
| 128 | 86 | 512 |
Removing item from dictionary doesn’t lead to shrink table. The value of the item will marks as null but not empty. When looking up element in dictionary, it will keep probing once find this special mark. So deleting element from Python will not decrease the memory using. If you really want to do so, you can the items in the old dictionary to create a new one.
CPython used a modified random probing algorithm to choose the empty slot. This algorithm can traval all of the slots in a pseudo random order.
The travel order can be calculated by this formula: j = ((5*j) + 1) mod 2**i, where j is slot index.
For example, if table size is 8, and the calculate slot index is 2, then the traversal order should be:
2 -> (5*2+1) mod 8 = 3 -> (5*3+1) mod 8 = 0 -> (5*0+1) mod 8 = 1 -> 6 -> 7 -> 4 -> 5 -> 2
CPython changed this formula by adding perturb and PERTURB_SHIFT variables, where perturb is hash value and PERTURB_SHIFT is 5. By adding PERTURB_SHIFT, the probe sequence depends on every bit in the hash code, and the collision probability is decreased. And perturb will eventually becomes to 0, this ensures that all of the slots will be checked.
j = (5*j) + 1 + perturb; perturb >>= PERTURB_SHIFT; j = j % 2**i
CPython 3.6 use a compact representation to save entries, and “The memory usage of the new dict() is between 20% and 25% smaller compared to Python 3.5”.
As mentioned before, entries saved in the form of <hash,key,value>. This will takes 3B on 64 bit machine. And no matter how much item is added into the dictionary, the memory usage is the same(3B*table_size).
After 3.6, CPython use two structure to save data. One is index, another is the real data.
For example, if the table size is 8, and there is an item in slot 1, the index looks like this:
[null, 0, null, null, null, null, null, null]
And the real data is:
| hash | key | value | | xxx1 | yyy1 | zzz1 |
0 represents the items index on real data. If another item is added in slot 3, the new index become this:
[null, 0, null, 1, null, null, null, null]
The real data become this:
| hash | key | value | | xxx1 | yyy1 | zzz1 | | xxx2 | yyy2 | zzz2 |
This saves memory, especially when table load factor is low.
Since the index table records the order of items, so the entries order is preserved. This feature is now part of the language spec since Python 3.7.
- How are Python’s Built In Dictionaries Implemented
- cpython source code
- Is it possible to give a python dict an initial capacity (and is it useful)
- Python dictionary implementation
Recently, I found a really good example code for Python circular import, and I’d like to record it here.
Here is the code:
# X.py
def X1():
return "x1"
from Y import Y2
def X2():
return "x2"# Y.py
def Y1():
return "y1"
from X import X1
def Y2():
return "y2"Guess what will happen if you run python X.py and python Y.py?
Here is the answer, the first one outputs this:
Traceback (most recent call last):
File "X.py", line 4, in <module>
from Y import Y2
File "/Users/kk/Y.py", line 4, in <module>
from X import X1
File "/Users/kk/X.py", line 4, in <module>
from Y import Y2
ImportError: cannot import name Y2
The second one runs normally.
If this is the same as you thought, you already know how python import works. You don’t need to read this post.
When Python imports a module for the first time, it create a new module object and set sys.modules[module_name]=module object , then executes execute in module object to define its content. If you import that module again, Python will just return the object save in sys.modules.
In X.py line 5, Python add Y into sys.modules and start execute code in Y.py. In Y.xy line5, it pause import Y, add X into sys.modules, and execute code X.py. Back to X.py line5, Python find Y in sys.modules and try to import Y2 in Y. But Y2 is not yet defined, so the ImportError was raised.
- Change import order.
- Wrap function call related to other module into
configurefunction, call it manually. - Dynamic import(use import within a function).
- Python Circular Imports
- Python Cirluar Importing
- Circular imports in Python
- Effective Python: 59 Specific Ways to Write Better Python
- Python doc: The import system
data.Field parameters is here.
When calling build_vocab, torchtext will add <unk> in vocabulary list. Set unk_token=None if you want to remove it. If sequential=True (default), it will add <pad> in vocab. <unk> and <pad> will add at the beginning of vocabulary list by default.
LabelField is similar to Field, but it will set sequential=False, unk_token=None and is_target=Ture
INPUT = data.Field(lower=True, batch_first=True)
TAG = data.LabelField()
train, val, test = data.TabularDataset.splits(path=base_dir.as_posix(), train='train_data.csv',
validation='val_data.csv', test='test_data.csv',
format='tsv',
fields=[(None, None), ('input', INPUT), ('tag', TAG)])all_data = data.TabularDataset(path=base_dir / 'gossip_train_data.csv',
format='tsv',
fields=[('text', TEXT), ('category', CATEGORY)])
train, val, test = all_data.split([0.7, 0.2, 0.1])train_iter, val_iter, test_iter = data.BucketIterator.splits(
(train, val, test), batch_sizes=(32, 256, 256), shuffle=True,
sort_key=lambda x: x.input)vectors = Vectors(name='cc.zh.300.vec', cache='./')
INPUT.build_vocab(train, vectors=vectors)
TAG.build_vocab(train, val, test)You can view vocab index by vocab.itos.
tag_size = len(TAG.vocab)vec = INPUT.vocab.vectors
class Model:
nn.Embedding.from_pretrained(vec, freeze=False)s = ' '.join(segmentize(s))
s = INPUT.preprocess(s)
vec = INPUT.process([s])Python supports multiple inheritance, its class can be derived from more than one base classes. If the specified attribute or methods was not found in current class, how to decide the search sequence from superclasses? In simple scenario, we know left-to right, bottom to up. But when the inheritance hierarchy become complicated, it’s not easy to answer by intuition.
For instance, what’s search sequence of class M?
class X:pass
class Y: pass
class Z:pass
class A(X,Y):pass
class B(Y,Z):pass
class M(B,A,Z):passThe answer is: M, B, A, X, Y, Z, object
How did Python generate this sequence? After Python 2.3, it use C3 Linearization algorithm.
C3 follows these two equation:
L[object] = [object] L[C(B1…BN)] = [C] + merge(L[B1]…L[BN], [B1, … ,BN])
L[C] is the MRO of class C, it will evaluate to a list.
The key process is merge, it get a list and generate a list by this way:
- First, check the first list’s head element(
L[B1]) as H. - If H is not in the tail of other list, output it, and remove it from all of the list, then go to step 1. Otherwise, check the next list’s head as H, go to step 2. (tail means the rest of the list except the first element)
- If merge’s list is empty, end algorithm. If list is not empty but not able to find element to output, raise error.
That seems complicated, I’ll use the previous example again to explain the calculation of C3.
Let’s begin with the easy ones. Firstly, calculate A’s MRO:
L[A(X,Y)]=[A]+merge(L[X],L[Y],[X,Y])
=[A]+merge([X,obj],[Y,obj],[X,Y])
# X is not tail of other list, use it as H
=[A,X]+merge([obj],[Y,obj],[Y])
# obj is in the tail of[Y.obj], use Y as H
=[A,X,Y]+merge([obj],[obj]]
=[A,X,Y,obj]
B’s MRO [B,Y,Z,obj] and Z’s MRO [z,obj] can also be calculated.
Now we can get M’s MRO:
L[M(B,A,Z)]=[M]+merge(L[B],L[A],L[Z],[B,A,Z])
=[M]+merge([B,Y,Z,obj],[A,X,Y,obj],[Z,obj],[B,A,Z])
=[M,B]+merge([Y,Z,obj],[A,X,Y,obj],[Z,obj],[A,Z])
# Y is in the tail of [A,X,Y,obj], use A as H
=[M,B,A]+merge([Y,Z,obj],[X,Y,obj],[Z,obj],[Z])
# Y is in the tail of [X,Y,obj], use X as H
=[M,B,A,X]+merge([Y,Z,obj],[Y,obj],[Z,obj],[Z])
=[M,B,A,X,Y]+merge([Z,obj],[obj],[Z,obj],[Z])
=[M,B,A,X,Y,Z]+merge([obj],[obj],[obj])
=[M,B,A,X,Y,Z,obj]
super also use C3 to find the inherited method to execute.
For instance, C’s MRO is C,A,B,Base,obj, so after enter A, it will output enter B rather than enter base.
class Base:
def __init__(self):
print('enter base')
print('leave base')
class A(Base):
def __init__(self):
print('enter A')
super(A, self).__init__()
print('leave A')
class B(Base):
def __init__(self):
print('enter B')
super(B, self).__init__()
print('leave B')
class C(A, B):
def __init__(self):
print('enter C')
super(C, self).__init__()
print('leave C')
c = C()enter C enter A enter B enter base leave base leave B leave A leave C
super works like this, it will get inst’s MRO, find cls’s index, return next class in MRO. (In python3, super(A,self) can be write as super())
def super(cls, inst):
mro = inst.__class__.mro()
return mro[mro.index(cls) + 1]When running this line super(C, self).__init__(), self is C’s instance, mro is:
[<class '__main__.C'>, <class '__main__.A'>, <class '__main__.B'>, <class '__main__.Base'>, <class 'object'>]
So it returns A, and A will execute __init__(), then calling super(A, self).__init__(), end enter B’s __init__(). (C’s instance inst will pass as self in the calling chain.)
- The Python 2.3 Method Resolution Order
- Python Multiple Inheritance
- python之理解super及MRO列表
- Python的MRO以及C3线性化算法
- C3 linearization
First zip all of the dependencies into zip file like this. Then you can use one of the following methods to import it.
|-- kk.zip | |-- kk.py
When submit spark job, add --py-files=kk.zip parameter. kk.zip will be distributed with the main scrip file, and kk.zip will be inserted at the beginning of PATH environment variable.
Then you can use import kk in your main script file.
This utilize Python’s zip import feature. For more information, check this link: zipimport
You can also upload zip file to hdfs, and using sc.addPyFile('hdfs://kk.zip') after SparkContext is initialized.
This has the same effect as --py-files, but your import statement must be after this line.
Have you ever tried to install MySQL-python? It contains the C code and need to compile the code while install the package. You have to follow the steps in this articles: Install MySQL and MySQLClient(Python) in MacOS. Things get worse if you are using Windows.
Luckily, as new distribution format Wheel has been published in PEP 427.
The wheel binary package format frees installers from having to know about the build system, saves time by amortizing compile time over many installations, and removes the need to install a build system in the target environment.
Installation of wheels does not require a compiler on system and is much faster.
Cibuildwheel is a very useful tool for building wheels. It can run on many CI server (GitHub Actions, Travis , Azure Pipelines etc) and build wheels across many platforms.
You need to create a configuration file for the CI server, you can read the examples and documents.
For example, GitHub Actions can use this configuration file:
name: Build
on: [push, pull_request]
jobs:
build_wheels:
name: Build wheels on ${{ matrix.os }}
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-18.04, windows-latest, macos-latest]
steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v2
name: Install Python
with:
python-version: '3.7'
- name: Install cibuildwheel
run: |
python -m pip install cibuildwheel==1.5.5
- name: Install Visual C++ for Python 2.7
if: runner.os == 'Windows'
run: |
choco install vcpython27 -f -y
- name: Build wheels
run: |
python -m cibuildwheel --output-dir wheelhouse
- uses: actions/upload-artifact@v2
with:
path: ./wheelhouse/*.whlThese options can be applied by setting environment variables.
Use this options to filter the Python versions to build.
Example:
# Only build on Python 3.6 CIBW_BUILD: cp36-* # Skip building on Python 2.7 on the Mac CIBW_SKIP: cp27-macosx_x86_64 # Skip building on Python 3.8 on the Mac CIBW_SKIP: cp38-macosx_x86_64
Execute the shell command before wheel building.
Now you can download wheelhouse.zip from Actions panel on GitHub, and unzip it to dist folder. Then manually publish it by rm -rf dist && python setup.py sdist && twine upload dist/*. You can get more detailed guide from this article: Packaging Python Projects.
This process can also be done automatically by using CI configuration file. You can find the example configuration files from official repo.
- Building Python Platform Wheels for Packages with Binary Extensions
- How to include external library with python wheel package
- cibuildwheel
In Django 3.1, Django support save python data into database as JSON encoded data and it is also possible to make query based on field value in JSONField. The detailed usage can be found here. If you are using older version and want to try this feature. Though there are many packages ported this function, I recommend django-jsonfield-backport.
This package save data as JSON in database and also support JSON query. If your database meet the requirements (MySQL > 5.7, PG > 9.5, MariaDB > 10.2 or SQLite > 3.9 with JSON1 extension), you can use JSONField like Django’s native implementation.
from django.db import models
from django_jsonfield_backport.models import JSONField
class ContactInfo(models.Model):
data = JSONField()
ContactInfo.objects.create(data={
'name': 'John',
'cities': ['London', 'Cambridge'],
'pets': {'dogs': ['Rufus', 'Meg']},
})
ContactInfo.objects.filter(
data__name='John',
data__pets__has_key='dogs',
data__cities__contains='London',
).delete()jsonfield is another popular package to use JSONField. It will save data as Text in database, but you can manipulate field value as python data. In addition, it does not provide JSON querying capability as django-jsonfield-backport.
As data is stored as JSON string in database, the output is string rather than object when Django DRF serialize jsonfield.JSONField. If you prefer to get and update the data like object, you need to manually specify it as `serializer.JSONField` like this:
from rest_framework import serializers
from .models import Product
class ProductSerializer(serializers.ModelSerializer):
images = serializers.JSONField()
class Meta:
model = Product
fields = '__all__'(You do not need to do this when using django-jsonfield-backport, everything just works.)
- GitHub - django-jsonfield-backport
- Use JSONField with Django Rest Framework
- JSONField in serializers – Django REST Framework
- Django REST framework - jsonfield
In Django, when edit field in admin page or post data to forms, the leading and tailing whitespace in CharField and TextField are removed.
The reason is strip=True parameter in forms.CharField, which is added in Djagno 1.9. You can see the discussion in django tiket #4960 and here is source code. models.CharField and models.TextField use formfield() to create form to interact with user, then both of them eventually create a forms.CharField
It only affect the value return from forms, you can still update model manually and calling save() to save it with spaces.
Normally, this feature help us to keep text field clean. But sometimes you may want to get the original value, and here are three different solutions:
Suppose we have this Test model.
# models.py
class Test(models.Model):
char = models.CharField(max_length=20)
text = models.TextField()# admin.py
TestAdmin(admin.ModelAdmin):
def formfield_for_dbfield(self, db_field, request, **kwargs):
if db_field.name in ['char', 'text']:
kwargs['strip'] = False
return super().formfield_for_dbfield(db_field, request, **kwargs)This method tackles the problem by overriding fields’ default fromfiled method.
# forms.py
class CustomForm(forms.ModelForm):
char = forms.CharField(strip=False)
text = forms.CharField(strip=False, widget=forms.Textarea)
class Meta:
model = Test
exclude = []
# admin.py
TestAdmin(admin.ModelAdmin):
form = CustomFormNow when edit data in admin panel, the whitespace is not removed anymore.
You can also use your custom field in models.py. For example:
# models.py
from django.db.models import TextField
class NonStrippingTextField(TextField):
def formfield(self, **kwargs):
kwargs['strip'] = False
return super(NonStrippingTextField, self).formfield(**kwargs)
class Test(models.Model):
text = NonStrippingTextField()====
If you use Django REST framework to edit data, you only need to change the serializer.
class TestSerializer(serializers.HyperlinkedModelSerializer):
class Meta:
model = Test
fields = '__all__'
extra_kwargs = {"char": {"trim_whitespace": False},
"text": {"trim_whitespace": False}}- StackOverflow - Django TextField and CharField is stripping spaces and blank lines
- Djanog - TextField constructor needs a strip=False option
- StackOverflow - In Django REST control serializer does not automatically remove spaces
- Allow Whitespace to be a Valid CharField Value in Django Admin
There is a historical memory leak problem in our Django app and I fixed it recently. As time goes by, the memory usage of app keeps growing and so does the CPU usage.
After some research, I figure out the cause. Some views does not close multiprocessing.Pool after using it. The problem disappears when I use Pool with with statement.
But I’m still interested in it and wrote some testing code. The script is run in Python 3.6.8 and produce similar result when using multiprocessing.ThreadPool.
import time
from multiprocessing import Pool
def func(i):
return i
def ori():
# create many thread as time goes by, when i==300 cpu grow to 300%, run out of 16g ram and stuck I have to kill process
p = Pool(4)
r.append(p.map(func, range(4)))
def with_close():
# 100% cpu, 0.1 ram, create 40 thread, takes 41s
p = Pool(4)
r.append(p.map(func, range(4)))
p.close()
def with_terminate():
# 5% cpu, 0.1 ram, create 4 thread, takes 425s
p = Pool(4)
r.append(p.map(func, range(4)))
p.terminate()
def with_with():
# same as terminate
with Pool(4) as p:
r.append(p.map(func, range(4)))
r = []
s = time.time()
for i in range(4000):
ori()
# with_close()
# with_terminate()
# with_with()
if i % 100 == 0:
print(i)
print(f'takes {time.time() - s} seconds')As you can see, there are four functions. The ori function is Pool with no close and terminate, the RAM keeps growing and the script stuck. with_close, with_terminate and with_with will exit normally but time is different.
Pool.terminate() will call terminate() in each worker. Pool.close() just change the pool states and each worker will terminate itself. You can find the source code on GitHub.
import gc
import time
import weakref
from multiprocessing import Pool
def func(i):
return i
p = Pool(4)
wr = weakref.ref(p)
p.map(func, range(4))
print(wr())
print(gc.get_referents(wr()))
# p.close()
# p.terminate()
time.sleep(1)
del p
gc.collect()
print(wr())
print(gc.get_referents(wr()))If not calling close or terminate, after execution, p is still referred by some objects:
<multiprocessing.pool.Pool object at 0x7fc0e6db0828>
[{'_ctx': <multiprocessing.context.ForkContext object at 0x7fc0e6d455c0>, '_inqueue': <multiprocessing.queues.SimpleQueue object at 0x7fc0e6db0860>, '_outqueue': <multiprocessing.queues.SimpleQueue object at 0x7fc0e5babac8>, '_quick_put': <bound method _ConnectionBase.send of <multiprocessing.connection.Connection object at 0x7fc0e620e8d0>>, '_quick_get': <bound method _ConnectionBase.recv of <multiprocessing.connection.Connection object at 0x7fc0e4d241d0>>, '_taskqueue': <queue.Queue object at 0x7fc0e4d24320>, '_cache': {}, '_state': 0, '_maxtasksperchild': None, '_initializer': None, '_initargs': (), '_processes': 4, '_pool': [<ForkProcess(ForkPoolWorker-1, started daemon)>, <ForkProcess(ForkPoolWorker-2, started daemon)>, <ForkProcess(ForkPoolWorker-3, started daemon)>, <ForkProcess(ForkPoolWorker-4, started daemon)>], '_worker_handler': <Thread(Thread-1, started daemon 140466410379008)>, '_task_handler': <Thread(Thread-2, started daemon 140466401986304)>, '_result_handler': <Thread(Thread-3, started daemon 140466393593600)>, '_terminate': <Finalize object, callback=_terminate_pool, args=(<queue.Queue object at 0x7fc0e4d24320>, <multiprocessing.queues.SimpleQueue object at 0x7fc0e6db0860>, <multiprocessing.queues.SimpleQueue object at 0x7fc0e5babac8>, [<ForkProcess(ForkPoolWorker-1, started daemon)>, <ForkProcess(ForkPoolWorker-2, started daemon)>, <ForkProcess(ForkPoolWorker-3, started daemon)>, <ForkProcess(ForkPoolWorker-4, started daemon)>], <Thread(Thread-1, started daemon 140466410379008)>, <Thread(Thread-2, started daemon 140466401986304)>, <Thread(Thread-3, started daemon 140466393593600)>, {}), exitprority=15>}, <class 'multiprocessing.pool.Pool'>]
<multiprocessing.pool.Pool object at 0x7fc0e6db0828>
[{'_ctx': <multiprocessing.context.ForkContext object at 0x7fc0e6d455c0>, '_inqueue': <multiprocessing.queues.SimpleQueue object at 0x7fc0e6db0860>, '_outqueue': <multiprocessing.queues.SimpleQueue object at 0x7fc0e5babac8>, '_quick_put': <bound method _ConnectionBase.send of <multiprocessing.connection.Connection object at 0x7fc0e620e8d0>>, '_quick_get': <bound method _ConnectionBase.recv of <multiprocessing.connection.Connection object at 0x7fc0e4d241d0>>, '_taskqueue': <queue.Queue object at 0x7fc0e4d24320>, '_cache': {}, '_state': 0, '_maxtasksperchild': None, '_initializer': None, '_initargs': (), '_processes': 4, '_pool': [<ForkProcess(ForkPoolWorker-1, started daemon)>, <ForkProcess(ForkPoolWorker-2, started daemon)>, <ForkProcess(ForkPoolWorker-3, started daemon)>, <ForkProcess(ForkPoolWorker-4, started daemon)>], '_worker_handler': <Thread(Thread-1, started daemon 140466410379008)>, '_task_handler': <Thread(Thread-2, started daemon 140466401986304)>, '_result_handler': <Thread(Thread-3, started daemon 140466393593600)>, '_terminate': <Finalize object, callback=_terminate_pool, args=(<queue.Queue object at 0x7fc0e4d24320>, <multiprocessing.queues.SimpleQueue object at 0x7fc0e6db0860>, <multiprocessing.queues.SimpleQueue object at 0x7fc0e5babac8>, [<ForkProcess(ForkPoolWorker-1, started daemon)>, <ForkProcess(ForkPoolWorker-2, started daemon)>, <ForkProcess(ForkPoolWorker-3, started daemon)>, <ForkProcess(ForkPoolWorker-4, started daemon)>], <Thread(Thread-1, started daemon 140466410379008)>, <Thread(Thread-2, started daemon 140466401986304)>, <Thread(Thread-3, started daemon 140466393593600)>, {}), exitprority=15>}, <class 'multiprocessing.pool.Pool'>]
After calling close() or terminate(), the last two lines become:
None []
The Python3.7 document adds this warning:
multiprocessing.poolobjects have internal resources that need to be properly managed (like any other resource) by using the pool as a context manager or by callingclose()andterminate()manually. Failure to do this can lead to the process hanging on finalization. Note that is not correct to rely on the garbage collector to destroy the pool as CPython does not assure that the finalizer of the pool will be called (seeobject.__del__()for more information).
In python 3.8.6, the script exits normally and the total execution time also decreases without calling close(). I found this issue is fixed in Python bug tracker: multiprocessing.Pool and ThreadPool leak resources after being deleted.
In [[https://github.com/Azure/azure-cli/pull/26923][[Packaging] Support Python 3.11 by bebound · Pull Request #26923 · Azure/azure-cli (github.com)]] , I bumped azure-cli to use Python 3.11. We’ve bump the dependency in other PRs, I thought it should be a small PR, but in the end, a lot of changes are made.
getargspec is dropped in 3.11. You can easily replaced it with =getfullargspec= . It returns FullArgSpec(args, varargs, varkw, defaults, kwonlyargs, kwonlydefaults, annotations) instead of ArgSpec(args, varargs, keywords, defaults) So args, _, kw, _ = inspect.getargspec(fn) can be replaced by args, _, kw, *_ = inspect.getfullargspec(fn) However, getfullargspec is retained primarily for use in code that needs to maintain compatibility with the Python 2 inspect module API.
Note that =signature()= and Signature Object provide the recommended API for callable introspection, and support additional behaviours (like positional-only arguments) that are sometimes encountered in extension module APIs. This function is retained primarily for use in code that needs to maintain compatibility with the Python 2
inspectmodule API. –inspect — Inspect live objects — Python 3.11.4 documentation
The modern signature function provides the similar result but needs more modification:
import inspect
def testfunc(a, /, b=1, c=2, *args, kk, **kwargs):
pass
print(inspect.getfullargspec(testfunc))
print(inspect.signature(testfunc).parameters)
for i, j in inspect.signature(testfunc).parameters.items():
print(i, type(i), j, type(j), j.kind)
args, _, kw, *_ = inspect.getfullargspec(testfunc)
print(args, kw)
from inspect import Parameter
parameters = inspect.signature(testfunc).parameters
args = [k for k, v in parameters.items() if v.kind in {Parameter.POSITIONAL_OR_KEYWORD, Parameter.POSITIONAL_ONLY}]
kw = next(iter([k for k, v in parameters.items() if v.kind == Parameter.VAR_KEYWORD]), None)
print(args, kw)FullArgSpec(args=['a', 'b', 'c'], varargs='args', varkw='kwargs', defaults=(1, 2), kwonlyargs=['kk'], kwonlydefaults=None, annotations={})
OrderedDict([('a', <Parameter "a">), ('b', <Parameter "b=1">), ('c', <Parameter "c=2">), ('args', <Parameter "*args">), ('kk', <Parameter "kk">), ('kwargs', <Parameter "**kwargs">)])
a <class 'str'> a <class 'inspect.Parameter'> POSITIONAL_ONLY
b <class 'str'> b=1 <class 'inspect.Parameter'> POSITIONAL_OR_KEYWORD
c <class 'str'> c=2 <class 'inspect.Parameter'> POSITIONAL_OR_KEYWORD
args <class 'str'> *args <class 'inspect.Parameter'> VAR_POSITIONAL
kk <class 'str'> kk <class 'inspect.Parameter'> KEYWORD_ONLY
kwargs <class 'str'> **kwargs <class 'inspect.Parameter'> VAR_KEYWORD
['a', 'b', 'c'] kwargs
['a', 'b', 'c'] kwargs
There is some custom classes in azure-cli, which makes Foo.BAR=‘bar’. In 3.11, the [[https://docs.python.org/3/whatsnew/3.11.html#enum][Enum]] =__format__() changes, it returns the enum and member name (ex: Color.RED). (The __str__ method is the same as Python 3.10)
Changed =Enum.__format__()= (the default for =format()=, =str.format()= and f-strings) to always produce the same result as
Enum.__str__(): for enums inheriting from =ReprEnum= it will be the member’s value; for all other enums it will be the enum and member name (e.g. =Color.RED=). –What’s New In Python 3.11
from enum import Enum
class Foo(str, Enum):
BAR = "bar"
# Python 3.10
f"{Foo.BAR}" # > bar
str(Foo.BAR) # > Foo.BAR
# Python 3.11
f"{Foo.BAR}" # > Foo.BAR
str(Foo.BAR) # > Foo.BARThe standard way to replace Foo class is StrEnum
class Foo(StrEnum):
BAR = "bar"
# Python 3.11
f"{Foo.BAR}" # > barIf you also use Bar(int, Enum), you can replace it with ReprEnum: Bar(int, ReprEnum).
The unittest module replace unittest.mock._importer with pkgutil.resolve_name in bpo-44686 replace unittest.mock._importer with pkgutil.resolve_name by graingert · Pull Request #18544 · python/cpython (github.com), which also introduces some changes.
Previously, it use __import__ to import the patch target, which does not check the module name. But pkgutil.resolve_name will check name first, thus mock.patch fails if the target is not a valid Python module name. For example, this statement fails in 3.11:
@mock.patch('azure.cli.command_modules.vm.aaz.2020_09_01_hybrid.network.vnet.List', _mock_network_client_with_existing_vnet_location)as 2020_09_01_hybrid is not a valid variable name in Python.
_NAME_PATTERN = re.compile(f'^(?P<pkg>{dotted_words})'
f'(?P<cln>:(?P<obj>{dotted_words})?)?$',
re.UNICODE)
m = _NAME_PATTERN.match(name)
if not m:
> raise ValueError(f'invalid format: {name!r}')
E ValueError: invalid format: 'azure.cli.command_modules.vm.aaz.2020_09_01_hybrid.network.vnet'As a workaround, mock.patch.object works.
vnet = import_module('azure.cli.command_modules.vm.aaz.2018_03_01_hybrid.network.vnet')
with mock.patch.object(vnet, 'List', _mock_network_client_with_existing_vnet):The ultimate solution is fix module name.
bpo-39716: Raise on conflicting subparser names. by anntzer · Pull Request #18605 · python/cpython (github.com) Raise an ArgumentError when the same subparser name is added twice.
import argparse
parser = argparse.ArgumentParser()
t = parser.add_subparsers()
t.add_parser('a')
t.add_parser('a')The above code works on 3.10 but raises this error in 3.11:
Traceback (most recent call last):
File "C:\Users\kk\Developer\azure-cli\p.py", line 6, in <module>
t.add_parser('a')
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1264.0_x64__qbz5n2kfra8p0\Lib\argparse.py", line 1192, in add_parser
raise ArgumentError(self, _('conflicting subparser: %s') % name)
argparse.ArgumentError: argument {a}: conflicting subparser: a
- Enum with
strorintMixin Breaking Change in Python 3.11 (pecar.me) - Enum with
strorintMixin Breaking Change in Python 3.11 · Issue #100458 · python/cpython (github.com) - What’s New In Python 3.11 — Python 3.11.4 documentation
It’s known that Python’s import statement is implemented by __import__ function. In general, if we want to import a module dynamically, we can use import_module function, which is a wrapper around __import__.
The most important difference between these two functions is that import_module() returns the specified package or module (e.g. pkg.mod), while __import__() returns the top-level package or module (e.g. pkg). – https://docs.python.org/3/library/importlib.html#importlib.import_module
import itertools and from requests import exceptions can be translated to:
import importlib
itertools = importlib.import_module('itertools')
exceptions = importlib.import_module('requests.exceptions')This is an advanced function that is not needed in everyday Python programming, unlike importlib.import_module().
Here is an example of how __import__ is called:
old_import = __import__
def noisy_importer(name, globals=None, locals=None, fromlist=None, level=0):
print(f'name: {name!r}')
print(f'fromlist: {fromlist}')
print(f'level: {level}')
print('-' * 80)
return old_import(name, locals, globals, fromlist, level)
import builtins
builtins.__import__ = noisy_importer
print('import math')
import math
print('from math import sqrt')
from math import sqrt
>>>
import math
name: 'math'
fromlist: None
level: 0
--------------------------------------------------------------------------------
from math import sqrt
name: 'math'
fromlist: ('sqrt',)
level: 0
--------------------------------------------------------------------------------As we mentioned earlier, the __import__ returns the top level module.
For example, requests=__import('requests.exceptions',globals(),locals(),[],0). If you want to get the submodule exceptions, you need to use getattr: equests_exceptions=getattr(__import__('requests', globals(), locals(), [], 0), 'exceptions').
There is another tricky way to import the submodule: use a non-empty fromlist: requests_exceptions = __import__('requests.exceptions', globals(), locals(), [None], 0).
Additionally, we can also set fromlist to specify the names of submodules that should be imported. The statement from spam.ham import eggs, sausage as saus can be translated to
_temp = __import__('spam.ham', globals(), locals(), ['eggs', 'sausage'], 0)
eggs = _temp.eggs
saus = _temp.sausageThis a use case of the __import__ function. Some packages are missing, but we want to make sure that the code does not crash when importing them.
import builtins
from unittest.mock import Mock
old_import = __import__
def skip_imports(name, globals=None, locals=None, fromlist=None, level=0):
skip_list = {'urllib3', 'requests_oauthlib', 'cryptography'}
if name in skip_list or any(name.startswith(f'{p}.') for p in skip_list):
return Mock()
else:
return old_import(name, globals, locals, fromlist, level)
builtins.__import__ = skip_imports- Built-in Functions — Python 3.12.1 documentation
- Python - How to use the \_\_import\_\_ function to import a name from a submodule? - Stack Overflow
Here is the process how sys.path is set in Python, with some parts omitted.
By default, as initialized upon program startup, a potentially unsafe path is prepended to sys.path:
python -m: prepend the current working directory.
python script.py: prepend the script’s directory. If it’s a symbolic link, resolve symbolic links.
python -c and python (REPL): prepend an empty string, which means the current working directory.
You can remove these path with -P param.
If this environment variable is set, the folders in it will be added to sys.path. The folders are separated by colons on Unix and semicolons on Windows.
These two variable define the standard Python modules and extension modules. Python has a specific path to search depends on the OS. The start point is Python executable path, which is called home (the symbolic links are followed).
Once home is determined, the prefix directory is found by looking for pythongmajorversionminorversion.zip. For example, python312.zip. On Windows, the zip package is in the same directory as the Python executable. On Unix, it is in /lib folder. If it is not found, on Windows, it will looks for Lib\os.py. On Unix, it will look for lib/python3.12/os.py.
On macOS, the home is /opt/homebrew/opt/python@3.12/Frameworks/Python.framework/Versions/3.12/bin/python3.12. The prefix is /opt/homebrew/opt/python@3.12/Frameworks/Python.framework/Versions/3.12, because lib/python3.12/os.py is there.
On Windows, the exec_prefix is the same as prefix. But on other OS, exec_prefix is determined by lib/python3.xx/lib-dynload. On my mac, it’s still /opt/homebrew/opt/python@3.12/Frameworks/Python.framework/Versions/3.12.
lib/python312.zip, lib/python3.12 and lib/python3.12/lib-dynload are added into sys.path.
This module is automatically called during Python startup, which tries to append the site-packages folder into sys.path. It can be disabled with -S option.
Finding site-packages folder is easy. It can be guessed by prefix and exec_prefix. These two path is head, and the tail part is lib/site-packages on Windows or lib/pythonX.Y/site-packages on *nix. For each of the head-tail combinations, it add the path into sys.path if it exists.
If a name.pth file exits in the site-packages folder, its content are additional items to be added into sys.path. Each line is a relative path.
The site module also tries to add USER_SITE folder into sys.path. Default value is ~~/.local/lib/pythonX.Y/site-packages~ for UNIX and non-framework macOS builds, ~~/Library/Python/X.Y/lib/python/site-packages~ for macOS framework builds, and %APPDATA%\Python\PythonXY\site-packages on Windows.
We can use python3 -m site to quickly check sys.path and user site. Here is the output on my mac:
sys.path = [
'{current folder}',
'/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python312.zip',
'/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12',
'/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/lib-dynload',
'/opt/homebrew/lib/python3.12/site-packages',
'/opt/homebrew/opt/python@3.12/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages',
]
USER_BASE: '/Users/kk/Library/Python/3.12' (doesn't exist)
USER_SITE: '/Users/kk/Library/Python/3.12/lib/python/site-packages' (doesn't exist)
ENABLE_USER_SITE: True
The path is slightly different from what the document states. The site-packages folder is not the same as prefix. I guess that because Homebrew creates lots of symbol link. python3 is /opt/homebrew/bin/python3 -> opt/homebrew/Cellar/python@3.12/3.12.4/bin/python3 -> /opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/bin/python3.
>>> import sys >>> sys.prefix '/opt/homebrew/opt/python@3.12/Frameworks/Python.framework/Versions/3.12' >>> sys.exec_prefix '/opt/homebrew/opt/python@3.12/Frameworks/Python.framework/Versions/3.12'
Here is the output on my Ubuntu server:
sys.path = [
'{current folder}',
'/usr/lib/python38.zip',
'/usr/lib/python3.8',
'/usr/lib/python3.8/lib-dynload',
'/home/kk/.local/lib/python3.8/site-packages',
'/usr/local/lib/python3.8/dist-packages',
'/usr/local/lib/python3.8/dist-packages/cloud_init-20.1-py3.8.egg',
'/usr/lib/python3/dist-packages',
]
USER_BASE: '/home/kk/.local' (exists)
USER_SITE: '/home/kk/.local/lib/python3.8/site-packages' (exists)
ENABLE_USER_SITE: True
The doc also does not explain why `/opt/homebrew/lib/python3.12/site-packages` is in the path. This doc is somewhat out-of-date: https://discuss.python.org/t/the-document-on-pythonhome-might-be-wrong/19614
- sys — System-specific parameters and functions
- The initialization of the sys.path module search path
- _pth files
Nowadays, pyproject.toml becomes the standard configuration file for packaging. Compare with the old setup.py, it adds two feature pep517 and pep518.
pep517 defines two hooks: build_wheel and build_sdist, which is required to build the package from source. Each build backend must implement these two hooks. It makes it possible to create other build backend such as flit or poetry.
[build-system]
# Defined by PEP 518:
requires = ["flit"]
# Defined by this PEP:
build-backend = "local_backend"
backend-path = ["backend"]Besides setuptools, there are some other build back-end such as hatchling and flit. You can find the example here: Python Packaging Uer Guide - Choosing a build backend
pep518 defines the format of pyproject.toml, where you can specify you build dependencies, ensuring the necessary tools will be installed when building project. For example:
[build-system]
requires = ["setuptools ~= 58.0", "cython ~= 0.29.0"]According to python packaging doc, it is still the valid configuration file for setuptools, but use setup.py in command is depracetd:
| Deprecated | Replacement |
|---|---|
| python setup.py install | pip install . |
| python setup.py develop | pip install -e . |
| python setup.py sdist | python -m build |
| python setup.py bdist_wheel | python -m build |
Build is a pep 517 compatible build fontend, it calls build backend to generate the source and wheel distribution. It’s the recommended way to build the package.
If the source distribution contains pyproject.toml, pip will use pep517 to build the package.
If the current env does not have setuptools` or wheel, pip will use pep517 to build the package: pip source code
You can also force pip to use pep517 with --use-pep517, or disable it and use the legacy behavior with --no-use-pep517.
This is a typical log which uses --use-pep517:
root@427314aff523:/# pip install requests --no-binary :all: --no-deps
Collecting requests
Using cached requests-2.32.3.tar.gz (131 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: requests
Building wheel for requests (pyproject.toml) ... done
Created wheel for requests: filename=requests-2.32.3-py3-none-any.whl size=64922 sha256=9ee1e853d3d86a8b484cf10c2920601befe81bfad4bd0c3319274b67143ac266
This is the one which uses the legacy behavior:
Processing d:\a\_work\1\s\src\azure-cli
Preparing metadata (setup.py): started
Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: azure-cli
Building wheel for azure-cli (setup.py): started
Building wheel for azure-cli (setup.py): finished with status 'done'
The main difference is that --use-pep517 will create a temp build env and build the package in it. The build env is totally fresh, it has to install the build dependencies such as setuptools first, then call the backend to build the wheel pacakge. Finally, it will install the wheel package.
In legacy behavior, pip use the current env’s pip, setuptools and wheel to build the package with python setup.py bdist_wheel, then install the wheel package.
When bump Python 3.12 in Azure CLI, the get-pip.py does not install setuptools by default, as well as wheel. So the pip tries to use pep517 to build azure-cli.
However, the runner agent is using Python 3.12.6 and the embedded Python is 3.12.7. They have a compatibility issue. In the build env, the python -m pip fails with code 57005, because it tries to load modules in 3.12.6. I have to use python -Im pip to install the package. However, in the build env, the -I param is not honored, the command still fails. I’ve created an issue in pip repo. The workaround is so complicated, so I have to install setuptools and wheel to let pip use the legacy behavior, which use the env’s pip to build the package.
The details can be found in the PR
- pip build_env.py source code
- original issue when bump Python 3.12
- pip build process for pyproject.toml
- PEP 517 and 518 in Plain English
- Writing your pyproject.toml
Recently, there is a GitHub issue about namespace package in Azure CLI. I think it is a good time to write down the knowledge about namespace package.
If several packages share the same root folder, then the root folder is a namespace package. subpackageA and subpackageb can be installed separately, even in different Python path, but they can be imported as importing a single package: import root.
There are three ways to create namespace package in Python, you can find the details in Packaging namespace packages.
Don’t need to create __init__.py in root folder, and use include = ["mynamespace.subpackage_a"] in pyproject.toml.
After installation, the root folder does not contain __init__.py, so Python treats it as an implicit because of PEP 420.
The only reason to use this method is to support Python2. You need to create this __init__.py in root folder:
__path__ = __import__('pkgutil').extend_path(__path__, __name__)After installation, the __init__.py is also kept in the root folder.
This method relies on setuptools. After Python 3.12, setuptools is not installed by default, and now it’s also deprecated in setuptools. Currently, it shows this warning:
UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
So I don’t recommend using this method.
You need to create this __init__.py in root folder, and declare the namespace package in setup.py: namespace_packages=['mynamespace'].
__import__('pkg_resources').declare_namespace(__name__)After installation, the __init__.py is not kept in the root folder.
If you forget to declare the namespace package in setup.py, __init__.py is kept in the root folder, and the root folder become a normal package. You will get trouble if you install subpackages in different Python path.
The main difference is whether there is a __init__.py in the root folder after installation. If there is, then it is a imported as regular package, otherwise it is a namespace package. pkgutil style will include a __init__.py in the root folder.
Another difference is that if it’s a namespace package, then if you change the sys.path, the namespace package will be updated. If it’s a regular package, then it will not be updated. For example, if you update the sys.path and there is new subpackage in the added path, the namespace package will be updated to include the new subpackages. But if it’s a regular package, it will not be updated, and you will get ModuleNotFoundError when you try to import the new subpackages.
I encountered this issue in the development of Azure CLI. azure-cli and azure-cli-core still use the pkgutls style namespace package. After installing them in editable mode, the extensions’ azure-xxxx dependency fails to load. Because there is a __init__.py in source code, and azure folder is treated as a regular package. The extension dependencies are added to the sys.path after azure is imported so the new azure=xxx package is ignored. (Maybe deleting azure from sys.moudule and importing again can fix this but I haven’t tried this.)
You should avoid mixing different style namespace packages, but if you’re working with legacy code, you may encounter this situation. (It get even worse that some packages’ packaging is wrong. For example, they forget to declare the namespace package in setup.py in pkg_resource-style namespace packages, then the root folder becomes a normal package.)
Let’s see the Python’s import mechanism first:
If
<directory>/foo/__init__.pyis found, a regular package is imported and returned.If not, but
<directory>/foo.{py,pyc,so,pyd}is found, a module is imported and returned. The exact list of extension varies by platform and whether the -O flag is specified. The list here is representative.If not, but
<directory>/foois found and is a directory, it is recorded and the scan continues with the next directory in the parent path. Otherwise the scan continues with the next directory in the parent path.
As you can see, if the __init__.py is found, the process stops and a regular package is imported. When you use the pkgutil style, it will also returns a normal package, even though it has tries to import the packages from all sys.path.
For example, if you use the pypa sample-namespace-packages to test the namespace package. In the pkgutil style package, import example_pkg returns a regular package: <module 'example_pkg' from '/Users/kk/Developer/sample-namespace-packages/pkgutil/pkg_a/example_pkg/__init__.py'>. In other two styles, it’s a namespace package: <<module 'example_pkg' (namespace) from ['/Users/kk/Developer/sample-namespace-packages/native/pkg_a/example_pkg']>.
So, it’s okay to mix these three styles, as all subpackages should be imported. But if some invalid package (not following these three styles, only a normal package) is also installed, the import might be interrupted by the invalid package.
If you use the native style or pkg_resource style but there is a normal __init__.py in sys.path. Although the namespace package has been imported, Python still only returns the normal package and ignored the namespace package. That’s why in the #31843 issue, the azure.cli is not found.
If a normal package and the pkgutil style namespace package are installed, if the pkgutil package loads first, then all subpackages are imported. If the normal package loads first, then only the normal package will be returned.
In conclusion, if you want to use namespace package, then use the native styles and make sure each subpackages are packaged correctly. Otherwise, you may encounter some module not found error.
Here are some shell tools I use, which can boost your productivity. Mordern-unix is a great repo that list lots of modern unix tools.
A zsh configuration framework. Provides auto completion, prompt theme and lots of modules to work with other useful tools. I extremely love the agnoster theme.
Help you to navigate between folders and launch application.
Here are the official usage example:
v def conf => vim /some/awkward/path/to/type/default.conf j abc => cd /hell/of/a/awkward/path/to/get/to/abcdef m movie => mplayer /whatever/whatever/whatever/awesome_movie.mp4 o eng paper => xdg-open /you/dont/remember/where/english_paper.pdf vim `f rc lo` => vim /etc/rc.local vim `f rc conf` => vim /etc/rc.conf
A fast code search tool similar to ack.
A great fuzzy finder, it can also integrate with vim by fzf.vim
Magnificent app which corrects your previous console command.
More concise and user-friendly man pages. (This screenshot uses powerlevel10k theme)
Another search tool. Use rg -. to include hidden files.
A user-friendly alternative to find. Ignore hidden files and gitignore file by default.
For example: fd -H 'flac$' search all files ends with flac.
Similar to cat with syntax highlighting and git integration.
A fast Zsh framework. You can use OMZ plugin like this:
export ZSH_CACHE_DIR=~/.cache
zmodule ohmyzsh/ohmyzsh --use degit --source 'plugins/fasd/fasd.plugin.zsh'—
- update 18/11/19 Add tldr powerlevel10k theme is a fancy ZSH theme
- update 29/12/21 Add rg, bat, fd
- update 06/01/22 Add Zim
- update 01/04/24 Add maintained-modern-unix repo
Over the years, I have read so many programmers’ blogs, which has helped me a lot. Now I think it’s the time to start my own blog.
I hope this can enforce myself to review what I have learned, and it would even be better if someone can benefit from it.
I failed to preview LaTeX with emacs-plus. If you have installed d12frosted/emacs-plus, uninstall it and use emacs-mac.
brew tap railwaycat/emacsmacport brew install emacs-mac
If you like the fancy spacemacs icon, install it with cask: brew cask install emacs-mac-spacemacs-icon
- Download and install BasicTeX.pkg here.
- Add
/Library/TeX/texbinto PATH. - Install
dvisvgmbysudo tlmgr update --self && sudo tlmgr install dvisvgm collection-fontsrecommended
- Add TeX related bin to path:
(setenv "PATH" (concat (getenv "PATH") ":/Library/TeX/texbin")) - Tell Org Mode to create svg images:
(setq org-latex-create-formula-image-program 'dvisvgm)
Now you can see the rendered LaTeX equation by calling org-preview-latex-fragment or using shortcut ,Tx.
If you want to load LaTeX previews automatically at startup, add this at the beginning of org file: #+STARTUP: latexpreview.
—
- update 31-07-19
_and...are not displayed in Emacs, as some fonts are missing.tlmgr install collection-fontsrecommendedshould fix this.If
Org Preview Latexbuffer output warnprocessing of PostScript specials is disabled (Ghostscript not found), runbrew install ghostscript.
After Inoreader change the free plan, which limit the max subscription to 150, I begin to find an alternative. Finally, I found Tiny Tiny RSS. It has a nice website and has the fever API Plugin which was supported by most of the RSS reader app, so you can read RSS on all of you devices.
This post will tell you how to deploy it on your server.
You need to install Docker and Docker Compose before using docker-compose.yml
Make a new ttrss folder, create docker-compose.yml with this content:
version: "3"
services:
database.postgres:
image: sameersbn/postgresql:latest
container_name: postgres
environment:
- PG_PASSWORD=PWD # please change the password
- DB_EXTENSION=pg_trgm
volumes:
- ~/postgres/data/:/var/lib/postgresql/ # persist postgres data to ~/postgres/data/ on the host
ports:
- 5433:5432
restart: always
service.rss:
image: wangqiru/ttrss:latest
container_name: ttrss
ports:
- 181:80
environment:
- SELF_URL_PATH=https://RSS.com/ # please change to your own domain
- DB_HOST=database.postgres
- DB_PORT=5432
- DB_NAME=ttrss
- DB_USER=postgres
- DB_PASS=PWD # please change the password
- ENABLE_PLUGINS=auth_internal,fever,api_newsplus # auth_internal is required. Plugins enabled here will be enabled for all users as system plugins
- SESSION_COOKIE_LIFETIME = 8760
stdin_open: true
tty: true
restart: always
command: sh -c 'sh /wait-for.sh database.postgres:5432 -- php /configure-db.php && exec s6-svscan /etc/s6/'
service.mercury: # set Mercury Parser API endpoint to =service.mercury:3000= on TTRSS plugin setting page
image: wangqiru/mercury-parser-api:latest
container_name: mercury
expose:
- 3000
ports:
- 3000:3000
restart: alwaysRun this command to deploy: docker-compose up -d. After it finished, the TTRSS service is running on port 181, the default account is admin with password password.
I made minor modification on the yml file, you can find the latest file here.
If you have a domain and you can use Nginx as reverse proxy to redirect TTRSS to the domain.
upstream ttrssdev {
server 127.0.0.1:181;
}
server {
listen 80;
server_name RSS.com;
return 301 https://RSS.com/$request_uri;
}
server {
listen 443 ssl;
gzip on;
server_name RSS.com;
access_log /var/log/nginx/ttrssdev_access.log combined;
error_log /var/log/nginx/ttrssdev_error.log;
location / {
proxy_redirect off;
proxy_pass http://ttrssdev;
proxy_set_header Host $http_host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-Ssl on;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header X-Frame-Options SAMEORIGIN;
client_max_body_size 100m;
client_body_buffer_size 128k;
proxy_buffer_size 4k;
proxy_buffers 4 32k;
proxy_busy_buffers_size 64k;
proxy_temp_file_write_size 64k;
}
ssl_certificate /etc/letsencrypt/live/rss.fromkk.com/fullchain.pem; # managed by Certbot
ssl_certificate_key /etc/letsencrypt/live/rss.fromkk.com/privkey.pem; # managed by Certbot
}
To enable HTTPS on your website, you can use certbot.
Update in 22/12/2021
I found Caddy2 is much easier to use than Nginx, all you need to do is add 3 lines in `/etc/caddy/Caddyfile`
rss.com {
encode gzip zstd
reverse_proxy 127.0.0.1:181
}
Voila, a HTTPS enabled website is deployed.
- Fever
- Check
Enable API: Allows accessing this account through the APIin preference - Enter a new password for fever in
Plugins - Fever Emulation
- Check
- Mecury Fulltext Extraction
- Check
mecury-fulltextplugin inPreference - Plugins - Set Mercury Parser API address to
service.mercury:3000inFeeds - Mercury Fulltext settings
- Check
Simply run this command to update TTRSS code.
docker-compose pull docker-compose up -d
Reeder 4 works great on my iPad. It’s smooth and fast, and is worth every penny.
If you want a free app, I suggest Fiery Feeds. I stopped using it after ver 2.2, as it’s so lagging. If this issue was fixed, I thought it was the biggest competitor for Reeder 4. For more alternative, read this article: The Best RSS App for iPhone and iPad.
- update 25-03-20:
You can find the latest document here.
Here is the main logic for jaeger agent and jaeger collector. (Based on jaeger 1.13.1)
Collect UDP packet from 6831 port, convert it to model.Span, send to collector by gRPC
Process gRPC or process packet from Zipkin(port 9411).
Listen gRPC and HTTP request from 16686.
These days I use InfluxDB to save some time series data. I love these features it provides:
According to to it’s hardware guide, a single node will support more than 750k point write per second, 100 moderate queries per second and 10M series cardinality.
Simple aggregation can be done by InfluxDB’s continuous queries.
If you submit a new point with same measurements, tag set and timestamp, the new data will overwrite the old one.
InfluxDB is well documented, but the group by time section is not very clear. It says it will group data by ==preset time boundary=. But the example it use is too simple and doesn’t explain it very well.
In the official example, when using group by time(12m)=, the time boundary is 00:12, 00:24. When using group by time(30m), the time boundary becomes 00:00, 00:30. It seems that the time boundary start from the nearest hour plus x times time interval, that’s not correct. If you using group by time(7m), the returned time boundary is not 00:07, 00:14
Here a example:
If the data is:
{'time': '2020-01-01T00:02:00Z', 'value': 10}
{'time': '2020-01-01T00:04:00Z', 'value': 8}
{'time': '2020-01-01T00:05:00Z', 'value': 21}
{'time': '2020-01-01T00:07:00Z', 'value': 33}
{'time': '2020-01-02T00:05:00Z', 'value': 9}
{'time': '2020-01-03T10:05:00Z', 'value': 4}
Execute select sum(value) from data where time>='2020-01-01 00:00:00' and time<'2020-01-04 00:00:00' group by time(7m) fill(none) will output:
{'time': '2019-12-31T23:58:00Z', 'sum': 18}
{'time': '2020-01-01T00:05:00Z', 'sum': 54}
{'time': '2020-01-02T00:00:00Z', 'sum': 9}
{'time': '2020-01-03T10:04:00Z', 'sum': 4}
Note that the time boundary begins at 12-31 23:58, not 01-01 00:00. What cause this?
InfluxDB using timestamp 0 (1970-01-01T00:00:00Z) as start time, and for each timestamp that is dividable by the group by interval, it create a boundary. So in this sql, the boundary should be timestamp 0, timestamp 420, timestamp 840 etc. 2019-12-31 23:58:00 convert to timestamp 1577836680, it’s dividable by 420, so this is the nearest time boundary among the given data.
When you use gourp by time(1w), you will also meet this problem: the result time begins with Thursday rather than Monday. As 1970-01-01 is Thursday.
So when you use group by time statement, you’d better use 30s, 1m, 5m, 10m as interval, which are factors of 1h, so the result always begin at xx:00.
Some times you want to calculate the sum of last recent 5m data every minute, by using group by time(5m), you only get 1 result every 5 minute. To achieve this, you can use the offset parameter in group by time statement. For example, group by time(5m,1m) with move the time boundary 1 minute forward, the result will be xx:01, xx:06. you can create 5 continuous queries with offset from 0 to 4.
More example can be found in this repo.
By reading the official resample document, the resample every <interval> for <interval> can override the continuous queries execute interval and the time range of query statement.
The example in official document the interval is always a multiple of group by time(m). I tries different values, here is the result.
every interval can be any value regardless of group by time interval. The CQ will execute at the time boundary of every interval.
for interval can be greater or equal to group by time(xx). If it is less than group by interval, influx will raise an error like this: ERR: error parsing query: FOR duration must be >= GROUP BY time duration: must be a minimum of 20s, got 5s
Here is a simple example, every 10 s for 45s group by time(20s)
| execute time | selected start time | selected end time | real start time | real end time |
| 16:00:30 | 15:59:45 | 16:00:30 | 16:00:00 | 16:00:40 |
| 16:00:40 | 15:59:55 | 16:00:40 | 16:00:00 | 16:00:40 |
| 16:00:50 | 16:00:05 | 16:00:50 | 16:00:20 | 16:01:00 |
| 16:01:00 | 16:00:15 | 16:01:00 | 16:00:20 | 16:01:00 |
We can see that, the execute interval is always 10s, but the start time and end time in CQ not equals to now()-45s-=now()=. It still based on group by time’s time boundary, but the start time must >= selected start time and end time is also >= selected end time.
Here is another example, every 5s for 10s group by time(10s)
| execute time | selected start time | selected end time | real start time | real end time |
| 16:00:00 | 15:59:50 | 16:00:00 | 16:59:50 | 16:00:00 |
| 16:00:05 | 15:59:55 | 16:00:05 | 16:00:00 | 16:00:10 |
| 16:00:10 | 16:00:00 | 16:00:10 | 16:00:00 | 16:00:10 |
| 16:00:15 | 16:00:05 | 16:00:15 | 16:00:10 | 16:00:20 |
I guess the reason why start time is always >= selected start time is to prevent pollute previous data. If the aggregated data is not enough, it will overwrite the correct data generated before. If there is not enough data in end time clause, it will be correct in the future.
I use Emacs to write blog. In the recent update, I found M-RET no longer behave as leader key in org mode, but behave as org-meta-return. And even more strange is that in other mode, it behave as leader key. And M-RET also works in terminal in org mode. In GUI, pressing C-M-m can trigger leader key.
SO I opened this issue, with the help of these friends, the issue has been fixed. Here is the cause of the bug.
In Emacs, RET is not a key in keyboard, it’s a logical key). Emacs bind RET to C-m in source code. In terminal, <Enter> and C-m both send <CR> (ASCII 13) character, so <Enter> / <Return> key is equal to RET. In GUI, pressing <Enter> / <Return> key actually sends <return> to Emacs, and Emacs automatically translate <return> to RET.
This can be proved: type SPC h d k <Enter> in spacemacs, it will output RET (translated from <return>) runs the command org-open-at-point, which is an
interactive compiled Lisp function in ‘org.el’.
Pressing C-m or <Enter> key usually given the same result, but you can also bind these with two different command. Take M-RET as example. If only <M-return> is bind, the M-RET is unbinded. If only M-RET is binded, then M-return is implicitly also bind to same command as M-RET.
In org mode scr:
(org-defkey org-mode-map (kbd "M-<return>") #'org-meta-return)
(org-defkey org-mode-map (kbd "M-RET") #'org-meta-return)These two keys were binded to org-meta-return.
The unfixed Spacemacs configuration file binds C-M-m as dotspacemacs-major-mode-emacs-leader-key.
In GUI, the <Enter> key will send <return> to Emacs. Org mode has explicitly bind M-<return> to org-meta-return, so org-meta-return is triggered. In other mode, the M-<return> key binding is not defined, so <return> will translate to RET, then trigger leader key.
In the fixed version, dotspacemacs-major-mode-emacs-leader-key bind to M-<return> in GUI, and this override org mode’s binding. Finally meta return becomes leader key again.
- M-RET no longer org mode prefix in GUI
- Difference between the physical “RET” key and the command ‘newline in the minibuffer
- Emacs中的 return, RET, Enter, Ctrl-m解析
- How to turn off alternative Enter with Ctrl+M in Linux
It’s easy to get small dataset from Elasticsearch by using size and from. However, it’s impossible to retrieve large dataset in the same way.
As we know it, Elasticsearch data is organised into indexes, which is a logical namespace, and the real data is stored into physical shards. Each shard is an instance of Lucene. There are two kind of shards, primary shards and replica shards. Replica shards is the copy of primary shards in case nodes or shards fail. By distributing documents in an index across multiple shards, and distributing those shards across multiple nodes, Elasticsearch can ensure redundancy and scalability. By default, Elasticsearch create 5 primary shards and one replica shard for each primary shards.
How to decide which shard should the document be distributed? By default, shard = hashCode(doc._id) % primary_shards_number. To make this stable, the number of primary shards cannot be change the index has been created.
Usually, the shards size should be 20GB to 40GB. The number of shards a node can hold is depending on the heap space. In general, 1GB heap space can hold 20 shards.
As data is store in different shards. If there are 5 shards, when doing this query:
GET /_search?size=10
Each shards will generate 10 search result, and send results to coordinate node. The coordinate node will sort 50 items, and result the first 10 result to user. However when query become this:
GET /_search?size=10&from=10000
Although we only need 10 items, each shards has to return the first 10010 result to coordinate node, and coordinate node has to sort 50050 items, this search cost lots of resource.
As deep paging is costly, Elasticsearch has restrict from+size less than index.max-result-window, the default value is 10000.
The search method has to retrieve and sort the result over and over again, because it does not know how to continue the search from previous position.
scroll is more efficient when retrieve large set of data.
For example:
POST /twitter/_search?scroll=1m
{
"size": 100,
"query": {
"match" : {
"title" : "elasticsearch"
}
}
}
and the returned result will contains a _scroll_id, which should be passed to the scroll API in order to retrieve the rest of data.
POST /_search/scroll
{
"scroll" : "1m",
"scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ=="
}
Scroll return the matched result at the time of the initial search request, like a snapshot, and ignore the subsequent changes to the documents(index, update or delete). The scroll=1m is used to tell how long should Elasticsearch keep the context. If there no following requests using the returned scroll_id, the scroll context will expire.
PS: In fact, when dealing the initial search request, scoll will cache all the matched documents’ id, then get the size document content in batches for each following requests.
It’s also possible to split the scroll in multiple slices and consume them independently.
GET /twitter/_search?scroll=1m
{
"slice": {
"id": 0,
"max": 2
},
"query": {
"match" : {
"title" : "elasticsearch"
}
}
}
GET /twitter/_search?scroll=1m
{
"slice": {
"id": 1,
"max": 2
},
"query": {
"match" : {
"title" : "elasticsearch"
}
}
}
The above request contains split the slice into 2 parts by using max:2 parameter. These union of two requests’ data is equivalent to the result of a scroll query without slicing.
The slice of the document can be calculated by this formula: slice(doc) = hash(doc._id) % max_slice. This is quiet similar to the calculation of shards mentioned before. For example if slice is 4, and shards is 2. Then slices 0,2 are assigned to first shard and slices 1,3 are assigned to second shard.
When slices number is n, each matched documents use a n bitset to remember which slice it belongs to. So you should limit the number of sliced query you perform in parallel to avoid the memory explosion.
Getting hash(doc._id) is expensive. You can also use another numeric doc_value field to do the slicing without hash function. For instance:
GET /twitter/_search?scroll=1m
{
"slice": {
"field": "date",
"id": 0,
"max": 10
},
"query": {
"match" : {
"title" : "elasticsearch"
}
}
}
Query performance is most efficient when the number of slices is equal to the number of shards in the index. If that number is large (e.g. 500), choose a lower number as too many slices will hurt performance. Setting slices higher than the number of shards generally does not improve efficiency and adds overhead.
Scroll is not suitable for real-time user requests. After Elasticsearch 5, Search After API is added. It’s similar to scroll but provides a live cursor. It uses the results from the previous page to retrieve the next page.
To use search after, the query must be sorted, and the following query also contains search_after=previous sort value.
For example, if the initial query is this:
GET twitter/_search
{
"size": 10,
"query": {
"match" : {
"title" : "elasticsearch"
}
},
"sort": [
{"date": "asc"},
{"tie_breaker_id": "asc"}
]
}
Then you have to extract the sort value of the last document, and pass it to search_after to get the next page result.
GET twitter/_search
{
"size": 10,
"query": {
"match" : {
"title" : "elasticsearch"
}
},
"search_after": [1463538857, "654323"],
"sort": [
{"date": "asc"},
{"tie_breaker_id": "asc"}
]
}
- Elasticsearch: The Definitive Guide: Pagination
- Scalability and resilience: clusters, nodes, and shards
- ElasticSearch如何支持深度分页
- Documentation for scroll API is a bit confusing!
- Request Body Search: Scroll
- Optimizing Elasticsearch: How Many Shards per Index?
I wrote a Scala code to get the current time. However, the output is different on the development server and docker.
import java.util.Calendar
println(Calendar.getInstance().getTime)On my development server, it outputs Sun Oct 18 18:01:01 CST 2020, but in docker, it print a UTC time.
I guess it related to the timezone setting and do a research, here is the result.
All of the code can be found in this function: private static synchronized TimeZone setDefaultZone()
String zoneID = AccessController.doPrivileged(new GetPropertyAction("user.timezone"));
// if the time zone ID is not set (yet), perform the
// platform to Java time zone ID mapping.
if (zoneID == null || zoneID.isEmpty()) {
String javaHome = AccessController.doPrivileged(
new GetPropertyAction("java.home"));
try {
zoneID = getSystemTimeZoneID(javaHome);
if (zoneID == null) {
zoneID = GMT_ID;
}
} catch (NullPointerException e) {
zoneID = GMT_ID;
}
}First, it will check whether JVM has user.timezone property. If not, it will call this native method getSystemTimeZoneID, it was implemented in java.base/share/native/libjava/TimeZone.c, and the main logic is in java.base/unix/native/libjava/TimeZone_md.c.
In Timezone_md.c, it will find timezone by following steps, it will return the timezone immediately once found.
- Find
TZenvironment. - Read
/etc/timezone. - Read
/etc/localtime. If it is a soft link(ex:/usr/share/zoneinfo/Asia/Shanghai), return timezone by path. Otherwise, compare the content with all files in/usr/share/zoneinfo, if found, return timezone. - Return
GMTas timezone.
The available timezone in Linux can be listed by this command: timedatectl list-timezones
You can add -Duser.timezone=Asia/Shanghai as JVM parameters.
Add export TZ=Asia/Shanghai in .bashrc.
Set its content to Asia/Shanghai
Link it to /usr/share/zoneinfo/Asia/Shanghai
All of these methods should work
- Add this line before get time:
TimeZone.setDefault(TimeZone.getTimeZone("Asia/Shanghai")) - Set JVM property by code
System.setProperty("user.timezone", "Asia/Shanghai") - Set timezone manually in Calendar
Calendar.getInstance(TimeZone.getTimeZone("Asia/Shanghai"))
- How to Set the JVM Time Zone
- jvm linux 时区设置
- Java default timezone detection, revisited
- Java读取系统默认时区
- How to set a JVM TimeZone Properly
After update brew to latest version, when calling cask related command, it always outputs Error: Cask 'java' is unavailable: No Cask with this name exists., such as brew list --cask. However, the brew command works.
After doing some research, I found Java has been moved to homebrew/core. This makes sense now. I installed java by cask, but it’s not available now and cask throw this error. If I uninstall java from cask, the error should disappear.
This is not easy as cask is broken. Finally, I found this issue: brew cask upgrade fails with “No Cask with this name exists”. After running rm -rf "$(brew --prefix)/Caskroom/java, cask is back.
Kafka is a high-performance and scalable messaging system. Sometimes when handling big data. The default configuration may limit the maximum performance. In this article, I’ll explain how messages are generate and saved in Kafka, and how to improve performance by changing configuration.
In short, messages will assembled into batches (named RecordBatch) and send to broker.
The producer manages some internal queues, and each queue contains RecordBatch that will send to one broker. When calling send method, the producer will look into the internal queue and try to append this message to RecordBatch which is smaller than batch.size (default value is 16KB) or create new RecordBatch.
There is also a sender thread in producer which is responsible for turning RecordBatch into requests (<broker node,List(ProducerBatch)>) and send to broker.
The details can be found from these two articles: Apache Kafka - Message Format and A Guide To The Kafka Protocol - Apache Kafka - Apache Software Foundation.
Here are some important properties in RecordBatch are: batch_lenth, compresstion_type, CRC, timestamp and, of course, the List(Record).
Each Record consists of length, timestamp_delta, key(byte), value(byte) etc.
When look into the kafka topic data directory, you may find files like this:
00000000000000000000.log 00000000000000000000.index 00000000000000000000.timeindex 00000000000000000035.log 00000000000000000035.index 00000000000000000035.timeindex
Kafka saves each partition as segments. When new record comes, it append to the active segment. If the segment’s size limit is reached, a new segment is created as becomes the active segment. Segments are named by the offset of its first record, so the segments’ names are incremental.
Furthermore, the segment divided into three kinds of file: log file, index file and timeindex file.
- The
logfile contains the actual data - The
indexfile contains the record’s relative offset and its physical position in the log file. This makes the look up complexity for specific offset record toO(1). - The
timeindexfile contains the record’s relative offset and its timestamp.
Consumer keeps reading data from broker, and decompress data if necessary. It will put data into a internal queue and return the target number of records to client.
max.poll.records (default values is 500) means the maximum number of records returned in a single call to poll().
fetch.min.bytes (default value is 1) means the minimum amount of data the broker should return from a fetch request. If insufficient data is available, the server will wait up to fetch.max.wait.ms ms and accumulate the data before answering the request.
The default socket buffer value in Java client is too small for high-throughput environment.
socket.receive.buffer.bytes (default value is 64KB) and send.buffer.bytes (default value is 128KB) is the SO_RCVBUFF and SO_SNDBUFF for socket connections respectively. I recommend to set it to a bigger value or -1 to use the OS default value.
As mentioned before, producer always send message as RecordBatch. Each batch should be smaller than batch.size (default value is 16KB). Increasing batch.size will not only reduce the TCP request to broker, but also lead to better compression ratio when compression is enabled.
linger.ms is used to specific the wait time before sending RecordBatch, and it will effect the real size of RecordBatch indirectly. The producer groups together any records that arrive in between request transmissions into a single batched request. If the system load is low and the RecordBatch is not full, the producer sender will still send this batch once it has been waited for linger.ms. linger.ms’s default value is 0, which means producer will send message as quick as possible(but the messaged arrived between two send requests will also be batched to RecordBatch). Increasing this value not only makes real batch size be close to batch.size and reducing the number of requests to be sent, but also increases the delay of messages.
The buffer.memory (default value is 32MB) controls the total amount of memory available to the producer for buffering. If records are sent faster than they can be transmitted to the server then this buffer space will be exhausted. When the buffer space is exhausted additional send calls will block.
As the throughput keep growing, bandwidth may become bottleneck. It’s easy to tackle this by add compresstion.type param in producer. Once it is configured, the producer will compressed the RecordBatch before sending it to broker. If the records are texts, the compression ratio should be high and bandwidth usage will be significantly decreased.
There are two kind of compresstion.type, topic level and producer level.
If you set compresstion.type in producer, the producer will compress the records and send it to broker.
There is also a topic level compresstion.type configuration. When it is set, producer’s compression type is not constrained. The broker will convert data sent from producer to target compresstion.type. compresstion.type can be set as gzip, snappy, lz4, zstd, uncompressed, and producer. The default value is producer, which means the broker will keep the original data send from the producer.
How to choose compression type? According to cloudflare’s test result in Squeezing the firehose: getting the most from Kafka compression:
| type | CPU ration | Compression ratio |
| None | 1x | 1x |
| Gzip | 10.14x | 3.58x |
| Snappy | 1.61x | 2.35x |
| LZ4 | 2.51x | 1.81x |
Gzip has best compression ratio but take lots of CPU time. Snappy keeps a balance between the CPU time and space. The new compression type zstd added in Kafka 2.1 produce larger compression ratio than Snappy with the cost of a little more CPU time.
These are common configurations, you can find more from the official document contains such as max.in.flight.requests.per.connection.
- Kafka message format
- Kafka高性能探秘
- Exploit Apache Kafka’s Message Format to Save Storage and Bandwidth
- Consuming big messages from Kafka
- How does max.poll.records affect the consumer poll
- A Practical Introduction to Kafka Storage Internals
- Kafka message codec - compress and decompress
- 20 Best Practices for Working With Apache Kafka at Scale
- Kakfa Document
- kafka-python KafkaProducer
- Deep Dive Into Apache Kafka | Storage Internals
I wrote a Spark program to process logs. The number of logs always changes as time goes by. To ensure logs can be processed instantly, the number of executors is calculated by the maximum of logs per minutes. As a consequence, the CPU usage is low in executors. In order to decrease resource waste, I tried to find a way to schedule executors during the execution of program.
As shown below, the maximum number of logs per minutes can be a dozen times greater than the minimum number in one day.
If I can modify the executor number by size of data to proceed, the resource usage should increase.
Spark provide a similar configuration to control the number of executors. By enable spark.dynamicAllocation.enabled, spark will change number of running executors by task number automatically.
As is known to all, the action operators(such as count, collect) create Spark job. Each job is divided into stages by shuffle operation, and each data partition in the stage will become independent jobs. When dynamic allocation is enabled, if there have been pending tasks for spark.dynamicAllocation.schedulerBacklogTimeout seconds, driver will request for more executors. If the pending task still exists, the executor request will be triggered every spark.dynamicAllocation.sustainedSchedulerBacklogTimeou seconds. Furthermore, the number of executors requested in each round increases exponentially from the previous round. For instance, an application will add 1 executor in the first round, and then 2, 4, 8 and so on executors in the subsequent rounds. The number of total running executor should not exceed spark.dynamicAllocation.maxExecutors.
When receiving the first executor request, driver ask cluster manager to create executor. After the new executor is created, driver checks if there are more request waiting to created and handle all of the pending request.
The reason to use this strategy to create executor is to avoid creating too many executor when payload just peak for a short time and make sure there are enough executor to be created in a period of time if the payload keeps high.
After the executor is idle for spark.dynamicAllocation.executorIdleTimeout seconds, it will be released. The one which contains cache data will not be removed. To prevent the executor which keeps the shuffle data from being removed, a additional spark service is needed before spark 3.0. From 3.0, the external shuffle service is not required if spark.dynamicAllocation.shuffleTracking.enabled is used.
Dynamic allocation is easy to used, but there are two disadvantage:
- Slow scheduling. Creating executors is serial. If two or more executor is requested, driver will ask cluster manager to create executors for at least two times. This is an issue if pods creation takes time. In general, that is fine as the K8s 1.6 SLO is that 99% of pods should be created in 5s in a 5000 node cluster.
- Hard to release executor if each task is short. The release is based on the idle time. If there are so many short task, the executor is not like to idle as tasks are assigned uniformly.
In our spark program, the task is short and data must be processed in 1 minutes. So dynamic allocation not suitable.
Luckily, spark also provide a way to control the number of executors manually. We can use sc.requestExecutors and sc.killExecutors to create and delete executors.
In order to use these two function, we have to know the number of running executors and their IDs.
Number of Running Executors
The Spark program’s RAM usage can be obtained from sc.getExecutorMemoryStatus. It returns a dict list like this: [Map(10.73.3.67:59136 -> (2101975449,2101612350))]. The key is IP with port and value is a tuple contains the max RAM and available RAM. Please note that driver is also included in the return data.
IDs of Running Executors
IDs is required when calling sc.killExecutors. This can be found in Spark REST API. The executors information such as ID, cores and tasks is record in /applications/[app-id]/executors.
With the help of sc.requestExecutors, we can create as many executors as we want in one request. But the pod create time is still too long. To eliminate the pod create request, I used these strategies:
- The running executors is expected to finish job in 50s, fot the purpose of reversing some time for delayed tasks.
- When the expected executor is close to current running executors, no executor is requests or released.
- If there is backlog data, request more executors.
After using manually allocation, the CPU usage grows a lot and reaches 40%. The cores used by Spark programs drop from 1700 to 800. Furthermore, the Spark program can scale automatically.
Today I tried to delete an inactive Internet account on system preference. It was deleted successfully but come back again after 20 seconds. This drives me nuts.
I tried these methods, but none of them works.
- Boot in safe mode, delete account.
- Delete record in
ZACCOUNTtable in ~~/Library/Accounts/Accounts4.sqlite~. - Delete related items in Keychain Access app.
Later, RedHatDude’s answer gives me a clue, it looks like a iCloud sync problem. I tried to delete the account on my 3 MacBooks together. Thank goodness! It does not show up again.
I managed to fix this issue today. Here is what I did:
- Turn off keychain sync on all of my Apple devices (in the iCloud preferences). This included 2 Macs, an iPad and iPhone.
- On each device I removed the accounts I no longer wanted
- Then I turned keychain sync back on
This worked and all of my old/duplicate accounts are now gone.
- Removing Accounts from Internet Accounts
- Every time I delete an internet account it comes back
- Gmail account keeps coming back after delete
My first NAS is Synology DS120j, which is ARM based entry level product. It’s okay to use it for downloading and backup, but not power enough for running docker and virtual machine.
So I bought this NAS last month, and I’m satisfied with it. Here are the advantages and disadvantages.
- High performance.
It is equipped with J4125 quad-core 2.0 GHz processor, 8G RAM, two 2.5G Ports and 4 bays. Here is the spec. Although J4125 is not the fastest CPU in 2022(the newer model coming with N5105), it is still able to run several docker containers together, and I can even run Synology and Windows 10 inside build-in
Virtualization Station. - Low Price.
I bought is at 2050 Yuan (about $320). Synology has similar model DS920+, which is 70% more expensive.
- Qtier
A special feature only available on QNAP. It will automatically move hot and cold data between SSD and HDD to get higher performance. There is one noticeable change: this NAS is so quiet and you can hardly hear HDD running.
- No native HEIC Support
This means you can’t preview the photos imported from iPhone. You have to pay for $12 to buy CAYIN MediaSign Player to fix this.
- Slow Start and Shutdown
I know it’s rarely to restart NAS, but taking several minutes to restart is not acceptable. I hope QNAP engineers can improve this in the future.
Here are some docker images I think is useful:
johngong/qbittorrent
A torrent client which can also download new file through RSS feed.
p3terx/aria2-pro
Everyone knows what is aria2.
dreamacro/clash
A rule based proxy client.
jeessy/ddns-go
A simple DDNS client.
linuxserver/plex
Plex is app which helps you to manage and browser your media library. It can grab the metadata of TV shows, movies and music from Internet. It’s provide app in different platform so you can access your media from anywhere.
There is only one thing I think that needs to improve: playback speed is fixed, which is not convenient when watching animation.
You can also give Emby or Jellyfin a shot.
portainer/portainer-ce
QNAP has build-in docker-compose command. You can use this Web app if you prefer GUI.
vaultwarden/server
I deploy this on my VPS instead of NAS as port 443 is forbidden on NAS. It’s a alternative of 1Password. Although the app UI is not perfect, it has all the function required by a password manager. There is no official way to backup data, so I use crontab to run backup script to save data to Google Drive by Rclone.
The backup script is quite simple, you need to link your Google Drive account as google in Rclone before using this.
#!/bin/sh
pwd
echo backing up
rm bit.zip
zip -rq bit.zip ./data -x ./data/icon_cache/* ./data/bitwarden.log
rclone copy bit.zip google:/应用/vaultwardenThis setting makes it possible to switch input method based on the context of cursor when entering insert mode.
I’m using sis package with this configuration. You may need to install macism if you’re not using railwaycat/emacsmacport. More settings can be found in emacs-smart-input-source.
(sis-ism-lazyman-config
"com.apple.keylayout.US"
"com.apple.inputmethod.SCIM.ITABC")
(sis-global-cursor-color-mode t)
(sis-global-respect-mode t)
(sis-global-context-mode t)
(sis-global-inline-mode t)You can also install fcitx-remote for-osx and use cute-jumper/fcitx.el to do so. As homebrew no longer support some build options, you need to follow the install instructions in the GitHub repository to build fcitx.
I use a 14pt English font and 16pt Chinese font, one Chinese character is the same width as two English characters. It can be set by adding this into Emacs configuration file.
dotspacemacs-default-font '("Menlo"
:size 14.0
:weight normal
:width normal)
;; add into dotspacemacs/user-config()
(dolist (charset '(kana han symbol cjk-misc bopomofo))
(set-fontset-font (frame-parameter nil 'font)
charset (font-spec :family "PingFang SC"
:size 16)))If you enable the chinese layer in Spacemacs, it provides a more convenient function:
(spacemacs//set-monospaced-font "Menlo" "PingFang SC" 14 16)PS: valign provides visual alignment for Org Mode and Markdown without changing fonts.
It’s very common to copy a local file into the container when build docker image. In general, we use COPY command. But it creates a new layer and increase the final image size. If this is a temporal file and we don’t want users waste their storage space, how can we remove it? Here are some approaches.
If the file can be download from URL or you can create a local HTTP server to share the file, you can download the file, use it and delete it in one RUN command. For example:
RUN wget xxxx && unzip xxx && rm xxxYou can also mount file when build image if your file can’t be download from Internet or the file is secret. Use it to bind files or directories to the build container.
A bind mount is read-only by default, add rw parameter to make it writable. The changes during the build are discared after the build is complete.
The mounted folder are kept in the image, but the files are gone. Don’t forget to delete the empty folder if you want to keep image clean.
RUN --mount=type=bind,target=/target_dir/,source=./source_dir/,rwRUN --mount=type=bind,target=/azure-cli.rpm,source=./docker/azure-cli.rpm tdnf install ca-certificates /azure-cli.rpm -y && tdnf clean allYou can also use --squash to reduce image size.
Once the build is complete, Docker creates a new image loading the diffs from each layer into a single new layer and references all the parent’s layers. So the extra space created by COPY command can be freed by squash.
When working on a project with multiple developers, the line ending can be troublesome. This article will explain how to configure line ending in Git.
The line ending on Windows is CRLF, on Linux is LF. To prevent the line ending issue, we can set core.autocrlf to true on Windows to let git convert CRLF to LF when commit, and convert LF to CRLF when checkout. It is automatically configured if you install git on Windows.
Configuring Git to handle line endings - GitHub Docs
# Configure Git to ensure line endings in files you checkout are correct for Windows.
# For compatibility, line endings are converted to Unix style when you commit files.
$ git config --global core.autocrlf trueYou can also use .gitattributes to control the line ending in each repository. The .gitattributes file is a text file that tells Git how to handle files in the repository. You can specify the line ending of each file type in this file.
With * text=auto, Git handles the files in whatever way it thinks is best. This is a good default option.
Use *.c text to explicitly declare a file as a text file, so this file is always normalized and converted to native line endings on checkout.
Use *.png binary to explicitly declare a file as binary, so Git does not convert it. (binray is an alias for -text -diff)
You can use eol to force conversion when checkout. The following config enforces bat files to be converted to CRLF when checkout even on Mac and Linux.
* text=auto
*.bat eol=crlf
This is the result of git ls-files --eol on Windows and Linux:
git ls-files --eol src/azure-cli/az.bat
i/lf w/crlf attr/text=auto eol=crlf src/azure-cli/az.bat
i means the index, w means the working tree, attr means the attribute used when checking out or committing.
You can set eof to crlf or lf. If it’s not specified, the line ending will be determined by core.autocrlf or core.eol. If text is set but neither of those variables are set, then the default value is crlf on Windows and lf on Linux and Mac.
If you change the .gitattributes file, you need to run the following command to refresh the working tree.
# Please commit the .gitattributes changes before run this command.
git rm -rf --cached .
git reset --hard HEAD- Line endings in tarball also follows the
.gitattributes. It’s identical to Git checkout on Linux machine. - The
.gitattributessettings will only affect new commits. If you want to change the line ending of the files that already in the Git index after changing line ending settings, you can usegit add --renormalize .to force Git to refresh all tracked files. For example, if the bat file has been add ascrlfin Git index and then you set it astextin.gitattributes. Running this command asks Git change it tolfin index.
- gitattributes - Defining attributes per path
- Configuring Git to handle line endings - GitHub Docs
- {CI} Enforce LF-only line endings in git #27137 · Azure/azure-cli (github.com)
- Git - git-add Documentation –renormalize
The disk performance in WSL2 is poor, it takes a long time to run git status in a host’s repo. Moreover, if you set a fancy shell prompt, it will take a long time to show the prompt. This article will introduce how to speed up Git in WSL2.
The performance of file system in WSL2 is poor, it takes a long time to run git status in a host’s repo. The solution is to use git.exe in Windows folder. You can add this into your bashrc:
function git() {
if $(pwd -P | grep -q "^\/mnt\/c\/*"); then
git.exe "$@"
else
command git "$@"
fi
}If you have configured a fancy shell prompt, powerlevel10k for example, it will automatically get the git status when you enter a git repo. It will take a long time to show the prompt inside a host’s repo. You can accelerate it with two methods.
The first one is disable git status in prompt. You may edit the .p10k.zsh file and comment the vcs prompt element. Therefor, it will not get git status when enter a git repo. However, you can’t see the git status though you are in WSL repo.
The second way is to disable untracked file check. You can run this command to disable it:
# stop checking for unstaged and staged changes
git config bash.showdirtystate false
# stop checking for untracked files
git config bash.showuntrackedfiles falseIn this way, you can still see other git status such as branch name and staged files with a instant response.
I bought a iPod Video 5.5th Gen 80G recently. It’s only 200 Yuan (about $30) and I’m satisfied with it.
The original firmware supports few audio format, it even can’t play FLAC. I install rockbox on it, which support FLAC and other format and I can transfer music without using iTunes or Finder. It also support theme and plugin, which makes it more powerful.
If you restore the iPod on macOS, it raises Warning: This is a MacPod, Rockbox only runs on WinPods. See http://www.rockbox.org/wiki/IpodConversionToFAT32 during installation. The easiest way to fix this is to restore it on Windows.
When I tried to install rockbox 1.5.1 on macOS, it raised could not open ipod permission denied when I clicked install. Using sudo /Applications/RockboxUtility.app/Contents/MacOS/RockboxUtility can fix this issue. Some said using 1.4.1 on other OS can fix this issue, but I haven’t tried it.
My iPod hangs on Waiting for system to remount player when I install rockbox. After timeout, I disconnected iPod and restart again. The startup screen shows Can't load rockbox.ipod: file not found. I connect iPod to computer and use rockbox utility to install rockbox again. I unchecked the bootloader, and only install rockbox, fonts and Plugin Data. The error is gone.
There are many themes in rockbox. I prefer fresh os light and adwaitapod Simplified. They also provide the dark version.
See https://d00k.net/wiki/rockbox_advanced/font_combining/
The original HDD is small, slow and fragile comparing with SSD, you can replace it with a SSD.
- CE to m2 adaptor (chip: JMB20330) and a 2242 m.2 SATA SSD (Recommended)
- CE to TF card adaptor (adaptor is expensive but longer battery life)
- CE/ZIF SSD (The product discontinued)
Not all iPod OS can support large SSD. It has the maximum track limit and SSD size limit in default OS. The track limitation stems from the RAM size, the large capacity model comes with more RAM and higher track threshold. For IPC 6th and 6.5th, if you release a SSD which is larger than 128G, iTunes only recognizes 128G. This is due to the LBA28 Limitation. Both Limitations can be eliminated by rockbox. If you want to stay in the original OS, I recommend you buying a 5.5th Gen 80GB or 7th Gen.
| Model Description | Model No. | iTunes Storage Limit (see note below) |
| 5th Gen 30Gb | MA002 / MA146 / PA002 / PA146 | ~20000 Tracks |
| 5th Gen 60Gb | MA003 / MA147 / PA003 / PA147 | ~50000 Tracks |
| 5.5th Gen 30Gb | MA444 / MA446 / PA444 / PA446 / MA664 | ~20000 Tracks |
| 5.5th Gen 80Gb | MA448 / MA450 / PA448 / PA450 | ~50000 Tracks |
| 6th Gen 80Gb | MB029 / MB147 / PB029 | 128Gb / ~50000 Tracks |
| 6th Gen 160Gb | MB145 / MB150 / PB145 / PB150 | 128Gb / ~50000 Tracks / *Requires Ribbon* |
| 6.5th Gen 120Gb | MB565 / PB565 / MB562 / PB562 | 128Gb / ~50000 Tracks |
| 7th Gen 160Gb | PC297 / MC297 / PC293 / MC293 | ~50000 Tracks |
Source: https://www.iflash.xyz/store/iflash-compatibility/
iPod 5th Generation (Video) Hard Drive Replacement
iPod comes with a 580 mah or 850 mah battery base on the thin or thick back cover of your iPod.
After replace the SSD, there is a bigger space for battery. You can replace a larger battery to get longer battery life. Here is the guide: https://www.ifixit.com/Guide/iPod+Classic+Battery+Replacement/561.
Some seller even provide a 3000 mah battery, I’m not sure whether your iPod has enough space for it, please as the seller before buying it. Some said the 3000 mah battery is fake, it’s actually a 1800 mah battery. Source: 3rd party extended Battery guide
iPod 5th Generation (Video) Hard Drive Replacement
- Topic: iPod classic 6th Gen LBA28 how to exceed 128gb limit?
- SELLTOONE 128GB SSD for iPod Classic 6th 7th iPod Video 5Gen 5.5th Replace HS081HA MK8010GAH MK8022GAA MK1634GAL MK1231GAL ZIF CE Solid State Drive (128GB)
- iPod 6/6.5 gen iFlash and Rockbox limitations?
Three years ago, I bought a QNAP TS-453Dmini NAS. Although it has a slow WEB UI and slow restart, it still fits my needs as all of the applications I need are running in Docker.
Recently, I want to move some files from my Mac to NAS to save space. I need a application behave like Dropbox, which can show all the files in the NAS and only download the files I need. I have tried the QSync, but it does not have thumbnails for cloud image and it does not have icons to show the file status. I also tried the Seafile, it’s a powerful application, which requires 4G RAM to run, and there is bug in the thumbnail. I used to have a Synology ARM NAS, the Synology Drive has all the features I need, so I want to run it on my QNAP NAS. After some research, I managed to run Synology and QNAP together on my NAS. Here is the guide.
To run Synology on QNAP, the first step is to install PVE. I put the PVE iso to a Ventoy USB drive and reboot the NAS. When hearing the beep, press the DEL key immediately to enter the BIOS. Change the boot order to boot from the Ventoy and then install the PVE on the NAS. Remember to move the PVE disk to HDD bay 3 or 4, and the QNAP bios won’t set the bay 1 or 2 as boot disk.
It’s pretty easy to install Synology on PVE, there are many guides in the Internet. Thanks to the RR project, you can install Synology on any x86 hardware.
I have one SSD and one HDD. The SSD is used for PVE and Synology, but I don’t have another disk to move the data to Synology. So I still need to run QNAP on PVE. Although the QNAP is not as popular as Synology, there is still a way to run it on PVE. I just followed this guide to install QNAP on PVE. To pass-through the HDD, just run ls -l /dev/disk/by-id/ to show all device and qm set {qnap vm id} -sata0 /dev/disk/by-id/{hdd_id}` to pass-through the HDD to QNAP VM. When restart the QNAP VM, it will ask you whether to load the OS from HDD, reset OS (but keep the data) or init the OS.
However, when I choose load the OS, it reboot and back to the same screen. I guess it might be a compatibility issue. Reset OS may work. For me, I create a new disk for the QNAP VM an install the QNAP OS, then I pass-through the HDD to the QNAP VM. Then in the Storage and Snapshot app, under Storage-Disks/VJOBD, on the right corner, click the More button and choose Recover. I managed to recover the data in the HDD. After that, I can use the QNAP as usual.
After using Synology for a while, I found the Synology OS is more user-friendly than QNAP OS. The UI is more responsive and the applications are more polished. There are more community applications available for Synology, and the overall experience feels more integrated. The Synology OS is more optimized. For example, the Synology backend services uses much less CPU and RAM than QNAP. I will migrate all of my data to Synology in the future once I got a new HDD.
I bought a Kindle Paperwhite 1 in 2013, when I still in the university. I like it very much and I’ve read many programming books with it. It still works fine after 10 years, but I want to tries the new model with larger screen and faster fresh speed. So I bought a used Kindle Paperwhite 5 signature version for only 620 Yuan (about $90) recently. Here is my review.
The Kindle Paperwhite 5 has a 6.8 inch screen with 300 PPI, the screen size and the PPI both increase comparing with the previous model. The Kindle Paperwhite 6 has a larger 7 inch screen, and a double RM(1G) than Paperwhite 5(512M). You can buy a 16G model for 850 Yuan (about $120). I think the overall experience is good enough for reading books, so I choose PKW5.
The signature edition has a 32G storage, wireless charging and auto adjust light sensor. I haven’t used the wireless charging. I don’t think it’s necessary as the battery life is so long and I hardly need to charge it. The auto adjust light is not noticeable as well. For the storage, I think 8G is big enough if you don’t read manga.
The biggist difference between PKW5 and PKW1 is response speed. The KPW5 is much faster when openning books or click UI buttons. The page turning speed is also significantly improved.
Overall, I’m satisfied with the new model. If you want to buy a Kindle, I recommend you buying the KPW5 or KPW6, as a larger screen is worthy.
If you want to read PDF on Kindle, KOReader is a must-have app. You can follow the Kindle Modding guide to jailbreak your Kindle and install KOReader.
There are some plugins I recommend:
- Kindle FileBrowser: You can transfer books through wifi without using USB cable.
- KOL: Open KOReader from main UI.
- Cover Setter: Set a better cover for KUAL/KOL/HOTFIX
Besides the must have Calibre, there are two tools I recommend for Kindle users:
- kaf-cli: A CLI tool to convert txt to epub, with a orly style cover.
- Kindle Comic Converter: A tool to convert manga to mobi, which can be read on Kindle with full screen.



















































