Logistic regression

Logistic regression is a supervised machine learning model used for classification tasks. It works by learning an hyperplane defined by some coefficients (or weights) θ = [θ0,…,θd] such that it can split the data into two subsets according to their labels. These coefficients θ are composed by a weights vector w = [w1,…,wd] plus a bias term b. For simplicity, we will consider a binary … Continue reading Logistic regression

PCA: Principal Component Analysis

PCA (Principal Component Analysis) is an unsupervised machine learning algorithm used to reduce the dimensionality of the given data. It has first been invented by Karl Pearson (1901) and independently developed by Harold Hotelling (1933). Dimensionality reduction refers to the mapping of the original high-dimensional data onto a lower-dimensional space, thus reducing the risk of model overfitting and improving the generalization ability of the model … Continue reading PCA: Principal Component Analysis

Log analysis for anomaly detection

Anomaly detection plays an important role in the management of modern large-scale distributed systems. Logs are widely used for anomaly detection, recording system runtime information, and errors. Traditionally, operators have to go through the logs manually with keyword searching and rule matching. The increasing scale and complexity of modern systems, however, makes the volume of logs explode, which renders the infeasibility of manual inspection. To … Continue reading Log analysis for anomaly detection

Integer Partition

A partition of a positive integer n is a multiset of positive integers such that their sum is equal to n. We denote the number of partitions of n by p(n). What is the number of ways to partition n into no more than k positive integers? Difficulty: Very hard. Input There are two integers, n and k. Output Output the number of ways modulo 10^9+7 to split n into no more than k positive integers. Example Given n=5 … Continue reading Integer Partition

Lattice Paths

Going from the point (x1,y1) to the point (x2,y2) in the Cartesian graph takes at least |x1−x2|+|y1−y2| steps moving unit length along the x-axis or the y-axis at a time (taxicab geometry). We now wonder, how many ways are there, to go from (x1,y1) to (x2,y2) using least steps, and without crossing or touching the line y=x? Difficulty: Medium. Input Four integers, representing x1, y1, x2, y2, respectively. Output The number of ways modulo 10^9+7 to go from (x1,y1) to (x2,y2) using least steps, … Continue reading Lattice Paths

Visualization and analysis of web engine data

Data preprocessing and visualization are important skills of managing Internet services such as search engines and online social networks. In this post, we are going deal with two weeks of search logs from a large search engine and learn some practical techniques to analyze the data. Background and data Once you submit a query to a search engine, the search engine will log some related … Continue reading Visualization and analysis of web engine data

The imitation game

Alan Turing, a well know mathematician and computer scientist, has always been obsessed in finding a solution for the question “Can machines think?”. I think this question hasn’t been answered yet, not because of the lack of technology, considering that machines have been made by human engineers, we perfectly understand how they work. What we don’t fully understand is how our brain work, that gray … Continue reading The imitation game