Many ways of using TensorFlow, PyTorch, and Keras

TensorFlow, PyTorch, and Keras being currently the predominant deep learning frameworks, it is every data scientist’s mission to master these frameworks. Yet the complexity and richness of these frameworks enable many different ways of doing the same task and hence cause confusion. As a sign post, this article is a rundown of various ways of leveraging these frameworks to perform the same machine learning task. In particular, I will explain how to use each framework to solve Ordinary Least Square at a low, middle, and high level.

Read More

Demystifying KerasTuner

Hyperparameter tuning is a pivotal concept in machine learning, be it deep learning or not. For deep learning, KerasTuner is a powerful and easy-to-use package, which can tune any Keras model in any way. Yet, its intricate internal mechanism concealed beneath its API makes users uneasy. This post will demystify its internal implementation and help you understand how KerasTuner accomplishs this feat. I hope that after reading this post, you will feel confident in using KerasTuner, be it for tuning a neural network or any other custom models. If you are an aspiring, hard-working data scientist, this post is a sure read.

Read More

An Overview of the Reptile Family

If you are into scientific computing with Python, you’re likely familiar with Anaconda and its conda package manager. You might also be aware of miniconda, conda-forge, and mamba, but unsure of their distinctions and which is best suited for your needs. This post aims to provide clarity and assist you in navigating the different package managers available.

Read More

An Introduction to Randomized Sketching

In this post, I will make an introductory presentation about sketching, a statistical technique to handle large datasets. First, I will give the intuitive idea behind sketching, which is also the most important and valuable part of this post. Then, I will describe the various sketching algorithms in detail. Finally, I will give a non-exhaustive list of theoretical results concerning the soundness of sketching. Since this post is an introduction, I will build my presentation around Ordinary Least Square, which is the first topic in every machine learning course and arguably the most popular technique.

Read More

Simpson's Paradox and Impostor Syndrome

As a chronic sufferer of impostor syndrome, I always know what it is and how it is formed. Nevertheless, I never get ashamed; actually, it pushes me forward and helps me surpass myself again and again. To some extent, I am grateful and even feel proud of it. Although I am a master of impostor syndrome and proactively use it as a weapon, I never found a proper mathematical model to describe it until I came across Simpson’s paradox once again recently. In this post, I will explain both concepts and make the link between them.

Read More

She wants to know how much I love her: a gentle tutorial of Yao's Millionaires' Problem

Recently my girlfriend wanted to know whether I love her more than she loves me. For this purpose, she asked me to rate my love in a scale of 1 to 10. To escape from her interrogation, I returned the same question to her. It ended with that both of us wanted to know how deep the other’s love is, but neither wanted to disclose his/her own secret. As a smart solution expert, I proposed to treat this dilemma as Yao’s Millionaires’ Problem.

Read More

Advanced knowledge about Python (>=3.3) package structure and import model

A package is a group of reusable modules organized in one or a hierarchy of folders. Although modules themselves are already reusable without being bundled in a package, a package structure allows the code to be published and used by other programmers. This blog post addresses some advanced package development issues which are not present in module development.

Read More

Future of my TSCV package -- a letter to my users

Nearly two years ago, I developed a time-series cross-validation package, namely tscv, which has since been widely adopted by scientists and quantitative traders worldwide. Seeing ~1000 monthly downloads, I am delighted that I made some positive contributions to this world. Meanwhile, in the last two years, a lot has happened to our world as well as to me. Although I never for a second forgot my responsibility towards my users, I was, unfortunately, unable to maintain this package. In consequence, as you may have noticed, this package is no longer compatible with scikit-learn version 0.24 since two months ago. To respond to this issue, I decide to restore the compatibility and enhance tscv, and this post will witness my resolution.

Read More

Laureates of NeurIPS 2020

The accepted papers in NeurIPS 2020 have been announced. This year we have 1899 accepted papers. I have compiled the metadata of all these papers, based on which I can see the laureates of this year’s conference. To determine the laureates, for both individuals and organizations, I used the following four criteria: author contribution index, first author index, organization influence index, and organization sustainability index.

Read More

Research organization patterns, research process patterns, and my preferences

When I started the PhD, I knew nothing about the academia and thus spent a lot of time and efforts in mining the unspoken rules. I wish that someone could have lent me a hand, rather than leaving me wandering in the darkness. This painstaking experience has inspired me to help those younger so that they could have a smoother sailing in their intellectual journeys. In this post, I will try something similar but more profound.

Read More

Use my package TSCV for nested cross-validation

Recently, some reader asked me whether my time series cross-validation package TSCV can be used for nested cross-validation. I mulled it over and found the answer to be favorable. I planned to tell him this good news, but the answer quickly became lengthy. Therefore, I decided to turn the answer into a standalone post to address this question. In the following, I will explain the concept of nested cross-validation and its advantage as well as how to use TSCV or any similar packages for it. The same content is also hosted on GitHub. If you have any question, you can ask in either place (preferably in both places).

Read More

The 100th anniversary of Moore-Penrose inverse and its role in statistics and machine learning

All men are equal, but not all matrices have inverses. For instance, rectangular matrices do not have inverses; square matrices without full rank do not have inverses. The matrix rights activists (i.e. E. H. Moore, 1920; Arne Bjerhammar, 1951; and Roger Penrose, 1955) among mathematicians thus stood out and spoke for these computationally unfavored matrices. Thanks to their continual efforts, every matrix finally got an inverse, dubbed the Moore-Penrose (pseudo) inverse. These previously unfavored matrices have since contributed to the academia and revolutionized statistics and machine learning. In memory of its 100th anniversary, let me talk, in this post, about the Moore-Penrose inverse and its applications.

Read More

Yet another guide to deploy Plotly Dash on AWS Elastic Beanstalk

In August, I got interested in Amazon Web Service (AWS) and spent some time to get an AWS Cloud Practitioner certificate. To put into practice what I have learned during the training, why not develop a web application, I asked myself. Thus, I decided to create a Plotly Dash dashboard and deploy it on AWS. The service that I chose is AWS Elastic Beanstalk. You can find, on the Internet, several guides written by amateurs to teach you how to deploy Dash on AWS. However, there is something lacking in all these guides. Therefore, I, also an amateur, decided to write a guide myself. In the following, I will show you how to achieve this “feat” step by step. To understand this guide, it is a prerequisite to know how to develop a Dash application and what AWS Elastic Beanstalk is.

Read More

A walk-through of Hao Huang's solution to the sensitivity conjecture

Earlier this month (July, 2019), mathematician Hao Huang posted a proof of the Sensitivity Conjecture, which has troubled mathematicians for 30 years. To people’s surprise, this proof is only 2 page’s long and involves only undergraduate level math. On the Internet, you can find some reports, written for the general public, about the background story and the interpretation of the sensitivity conjecture. Also, several experts, such as Terence Tao, are elaborating on it. Here, writing for students and non-experts, I will summarize the key steps in Hao Huang’s proof, in an attempt to help them quickly grasp the essential.

Read More

Important inequalities in convex optimization, proofs and intuition

Many talk about data science and machine learning with enthusiasm, but few know about one of the most important building components behind them – convex optimization. Indeed, nowadays nearly every data science problem will first be transformed into an optimization problem and then solved by standard methods. Convex optimization, albeit basic, is the most important concept in optimization and the starting point of all understanding. If you are an aspiring data scientist, convex optimization is an unavoidable subject that you had better learn sooner than later.

Read More