In this post, I will make an introductory presentation about sketching, a statistical technique to handle large datasets. First, I will give the intuitive idea behind sketching, which is also the most important and valuable part of this post. Then, I will describe the various sketching algorithms in detail. Finally, I will give a non-exhaustive list of theoretical results concerning the soundness of sketching. Since this post is an introduction, I will build my presentation around Ordinary Least Square, which is the first topic in every machine learning course and arguably the most popular technique.
Not all tunnels are born equal; some can be your tomb. Not all vaccines are born equal; some can revive the epidemic. Not all software are born equal; some can cause air crashes. In this post, I will show you how a meticulous programmer can uncover the truth that even Richard Sutton has missed.
As a chronic sufferer of impostor syndrome, I always know what it is and how it is formed. Nevertheless, I never get ashamed; actually, it pushes me forward and helps me surpass myself again and again. To some extent, I am grateful and even feel proud of it. Although I am a master of impostor syndrome and proactively use it as a weapon, I never found a proper mathematical model to describe it until I came across Simpson’s paradox once again recently. In this post, I will explain both concepts and make the link between them.
Recently my girlfriend wanted to know whether I love her more than she loves me. For this purpose, she asked me to rate my love in a scale of 1 to 10. To escape from her interrogation, I returned the same question to her. It ended with that both of us wanted to know how deep the other’s love is, but neither wanted to disclose his/her own secret. As a smart solution expert, I proposed to treat this dilemma as Yao’s Millionaires’ Problem.
A package is a group of reusable modules organized in one or a hierarchy of folders. Although modules themselves are already reusable without being bundled in a package, a package structure allows the code to be published and used by other programmers. This blog post addresses some advanced package development issues which are not present in module development.
Nearly two years ago, I developed a time-series cross-validation package, namely
tscv, which has since been widely adopted by scientists and quantitative traders worldwide.
Seeing ~1000 monthly downloads, I am delighted that I made some positive contributions to this world.
Meanwhile, in the last two years, a lot has happened to our world as well as to me.
Although I never for a second forgot my responsibility towards my users, I was, unfortunately, unable to maintain this package.
In consequence, as you may have noticed, this package is no longer compatible with
scikit-learn version 0.24 since two months ago.
To respond to this issue, I decide to restore the compatibility and enhance
tscv, and this post will witness my resolution.
The accepted papers in NeurIPS 2020 have been announced. This year we have 1899 accepted papers. I have compiled the metadata of all these papers, based on which I can see the laureates of this year’s conference. To determine the laureates, for both individuals and organizations, I used the following four criteria: author contribution index, first author index, organization influence index, and organization sustainability index.
When I started the PhD, I knew nothing about the academia and thus spent a lot of time and efforts in mining the unspoken rules. I wish that someone could have lent me a hand, rather than leaving me wandering in the darkness. This painstaking experience has inspired me to help those younger so that they could have a smoother sailing in their intellectual journeys. In this post, I will try something similar but more profound.
Recently, some reader asked me whether my time series cross-validation package
TSCV can be used for nested cross-validation.
I mulled it over and found the answer to be favorable.
I planned to tell him this good news, but the answer quickly became lengthy.
Therefore, I decided to turn the answer into a standalone post to address this question.
In the following, I will explain the concept of nested cross-validation and its advantage as well as how to use
TSCV or any similar packages for it.
The same content is also hosted on GitHub.
If you have any question, you can ask in either place (preferably in both places).
All men are equal, but not all matrices have inverses. For instance, rectangular matrices do not have inverses; square matrices without full rank do not have inverses. The matrix rights activists (i.e. E. H. Moore, 1920; Arne Bjerhammar, 1951; and Roger Penrose, 1955) among mathematicians thus stood out and spoke for these computationally unfavored matrices. Thanks to their continual efforts, every matrix finally got an inverse, dubbed the Moore-Penrose (pseudo) inverse. These previously unfavored matrices have since contributed to the academia and revolutionized statistics and machine learning. In memory of its 100th anniversary, let me talk, in this post, about the Moore-Penrose inverse and its applications.
In August, I got interested in Amazon Web Service (AWS) and spent some time to get an AWS Cloud Practitioner certificate. To put into practice what I have learned during the training, why not develop a web application, I asked myself. Thus, I decided to create a Plotly Dash dashboard and deploy it on AWS. The service that I chose is AWS Elastic Beanstalk. You can find, on the Internet, several guides written by amateurs to teach you how to deploy Dash on AWS. However, there is something lacking in all these guides. Therefore, I, also an amateur, decided to write a guide myself. In the following, I will show you how to achieve this “feat” step by step. To understand this guide, it is a prerequisite to know how to develop a Dash application and what AWS Elastic Beanstalk is.
Earlier this month (July, 2019), mathematician Hao Huang posted a proof of the Sensitivity Conjecture, which has troubled mathematicians for 30 years. To people’s surprise, this proof is only 2 page’s long and involves only undergraduate level math. On the Internet, you can find some reports, written for the general public, about the background story and the interpretation of the sensitivity conjecture. Also, several experts, such as Terence Tao, are elaborating on it. Here, writing for students and non-experts, I will summarize the key steps in Hao Huang’s proof, in an attempt to help them quickly grasp the essential.
This guide documents one code style of static, class, and abstract methods in Python. Following this style, your code can be run in both Python 2.X and Python 3.X.
Many talk about data science and machine learning with enthusiasm, but few know about one of the most important building components behind them – convex optimization. Indeed, nowadays nearly every data science problem will first be transformed into an optimization problem and then solved by standard methods. Convex optimization, albeit basic, is the most important concept in optimization and the starting point of all understanding. If you are an aspiring data scientist, convex optimization is an unavoidable subject that you had better learn sooner than later.
Many newcomers of the Julia language feel confused about the value type described in the official documentation. They don’t understand what it is used for and why other languages don’t have this feature. In fact, as a “secret” rarely shared by the core developers, you may probably never need the value type.
In this post I will discuss one of the two best papers in ICML 2018 – Delayed Impact of Fair Machine Learning. Contrary to other papers constructing various innovative definitions of fairness, this paper analyzes the delayed impact of fairness policy. It shows that these policies do not necessarily improve the situation of the disadvantaged population: It may hurt them, in some cases, in the long run.