Accessible Machine Learning: Insights for Everyone
Table of Contents
ToggleMachine Learning for Everyone
Machine learning is similar to an affair in high school. Everyone is talking about it, just a few people know what to do, and your instructor is doing it. If you’ve ever tried to read machine learning articles on the Internet, you’ve probably come across two types: thick academic trilogies full of theorems (I couldn’t even get through half of one) and fishy fairytales about artificial intelligence, data-science magic, and future jobs.
I decided to create a post that I had wished existed for a long time. A easy introduction for everyone who has always wanted to understand machine learning. Only real-world situations, practical answers, straightforward language, and no high-level theorems. One and for all. Whether you’re a developer or a manager.
Let’s roll.
Classical Machine Learning
The earliest approaches emerged from pure statistics in the 1950s. They completed formal arithmetic activities such as identifying patterns in numbers, determining data point closeness, and computing vector directions.
Currently, half of the Internet is working on these algorithms. When you notice a list of articles to “read next” or your bank bans your card at a random gas station in the middle of nowhere, it’s usually the work of a small company.
Neural networks are quite popular among large technology businesses. Obviously. A 2% accuracy increase can result in an additional $2 billion in income. However, this is not feasible for little individuals. Several teams spent a year developing a recommendation system for their e-commerce website, only to learn that search engines accounted for 99% of traffic. Their algorithms were worthless. Most people did not even visit the main page.
Classical tactics are simple enough to convey to a youngster, despite their widespread appeal. They are similar to simple arithmetic, which we use without thinking.
Supervised Learning
Classical machine learning is classified into two categories: supervised and unsupervised.
In the first example, the machine has a “supervisor” or a “teacher” who provides all of the answers, such as whether the picture depicts a cat or a dog. The teacher has previously separated (classified) the data into cat and dog categories, and the computer is learning from these instances.
One by one. Dog by cat.
In unsupervised learning, the machine is given a collection of animal photographs and the duty of identifying them. The data is unlabeled, there is no teacher, and the computer tries to identify patterns on its own. We will discuss these strategies more below.
Machines learn quicker with a tutor, making them more useful for real-world jobs. There are two sorts of tasks: classification (predicting an object’s category) and regression (predicting a point on a numeric axis).
Classification
“Splits objects depending on one of the previously known qualities. Organize socks by color, documents by language, and music by genre.
Currently used for:
- Spam Filtering
- Language detection.
- A search for comparable documents.
- Sentiment Analysis
- Recognition of handwritten letters and numbers.
- Fraud detection.
Popular algorithms include Naive Bayes, Decision Tree, Logistic Regression, K-Nearest Neighbors, and Support Vector Machines.
Please include further information in the following sections. Feel free to include samples of tasks. All content is based on personal experience.
Regression is just classification in which we forecast a number rather than a category. Examples include automobile prices based on mileage, traffic by time of day, demand volume based on corporate growth, and so on.
Regression is ideal when anything is dependent on time.
Everyone who works in finance and analytics like regression. It’s even incorporated into Excel. And the interior is quite smooth – the machine merely attempts to create a line indicating average correlation. Unlike a person with a pen and a whiteboard, a computer does it with mathematical precision, calculating the average spacing between each dot.
When the line is straight, it is a linear regression; when it curves, it is a polynomial. There are two main forms of regression. The other options are more exotic. Logistic regression is the black sheep of the herd.
Don’t be fooled; this is a classification approach, not regression.
It’s OK to experiment with regression and classification, though. Many classifications become regression with some adjustment. We can not only determine the object’s class, but also remember how near it is. Here comes the regression.
Unsupervised learning
Unsupervised was created a little later, in the 1990s. It is used less frequently, but occasionally we don’t have an option.
Labeled data is a luxury. But what if I want to build a bus classifier? Should I physically photograph and name the millions of fucking busses on the streets? That would take a lifetime, yet I still have a lot of games on my Steam account that I haven’t played.
There’s some hope for capitalism in this scenario. Social stratification has led to the availability of millions of low-cost workers, such as Mechanical Turk, who can execute tasks for as little as $0.05. And that’s how things are normally done here.
Alternatively, you might use unsupervised learning. However, no practical applications come to mind. Although beneficial for exploratory data analysis, it should not be used as the primary algorithm. A skilled individual with an Oxford degree feeds waste into the system and monitors its operation. Are there any clusters? No. Are there any evident relationships? No. So, continue. You wanted to work in data science, right?
Clustering
“Divides objects based on unknown features. Machine chooses the best way”
Nowadays used:
• For market segmentation (types of customers, loyalty)
• To merge close points on a map
• For image compression
• To analyze and label new data
• To detect abnormal behavior
Popular algorithms: K-means_clustering, Mean-Shift, DBSCAN
Clustering is a classification method that does not use predetermined classifications. It’s similar to organizing socks by color when you can’t recall all of them. Clustering algorithms attempt to discover comparable (by certain attributes) items and combine them into a cluster. Those with a lot of similar characteristics are combined into one class. Some methods allow you to define the precise number of clusters you desire.
Web map markers serve as a great illustration of clustering. When you search for all vegan eateries nearby, the clustering engine sorts them into blobs with a number. Otherwise, your browser would freeze while attempting to display all three million vegan eateries in that trendy downtown.
Apple Photos and Google Photos employ more sophisticated grouping. They’re looking for faces in images to make albums about your buddies. The software does not know how many friends you have or how they appear, but it is trying to identify common face traits. Typical clustering.
Another common issue is picture compression. When exporting the image as PNG, you may set the palette to, say, 32 colors. It implies clustering will discover all the “reddish” pixels, determine the “average red,” and apply it to all the red pixels. Fewer colors equals a smaller file size and greater profit!
However, you may have issues with hues that resemble Cyan. Is this green or blue? Here comes the K-Means algorithm.
It randomly places 32 color dots in the palette. These are the centroids. The remaining spots are noted and allocated to the nearest centroid. This results in the formation of galaxies centered on 32 hues. We move the centroid to the center of the galaxy and repeat until it stops moving.
All done. Clusters are defined and stable, with exactly 32 of them.
Here’s a more realistic explanation:
It is convenient to search for centroids. However, in actual life clusters are not usually circular. Imagine you are a geologist. And you must locate some comparable minerals on the map. In that instance, the clusters may be irregularly formed and even nested. Also, you have no idea how many of them to expect. 10? 100?
K-means is not applicable here, although DBSCAN can be useful. Let’s imagine our dots represent individuals in the town square. Find three persons standing close together and invite them to hold hands. Then instruct them to start grasping the hands of neighbors they can reach. And so on, until no one else can grasp anyone’s hand. That is our first cluster. Repeat until everyone is grouped. Done.
A wonderful bonus: a person with no one to shake hands with is an outlier.
Everything looks nice in motion:
Similar to classification, clustering might be used to find abnormalities. Does the user act oddly after joining up? Allow the machine to temporarily ban him and submit a ticket to support for further investigation. Maybe it is a bot. We don’t need to define “normal behavior”; simply upload all user activities to our model and let the computer determine if they are “typical” or not.
This methodology does not perform as well as the categorization method, but it never hurts to try.
Dimensionality Reduction (Generalization)
“Assembles specific features into more high-level ones”
Today is used for:
- Recommender Systems (★)
- Beautiful visuals.
- Topic modeling and related document search
- Fake image analysis
- Risk management
Popular algorithms: Principal Component Analysis (PCA), Singular Value Decomposition (SVD), Latent Dirichlet allocation (LDA), Latent Semantic Analysis (LSA, pLSA, GLSA), t-SNE (for visualization)
This covers ways for analyzing shopping carts, automating marketing strategies, and managing event-related chores. These tools can help you identify patterns in a series.
Assume a consumer walks up to the checkout counter with a six-pack of beers.
Should we place peanuts along the way? How frequently do they purchase them together? While beer and peanuts are likely examples, can we anticipate other sequences? Is it possible to enhance revenues significantly with a slight modification in product arrangement?
The same goes for e-commerce. What will be the customer’s next purchase?
I’m not sure why rule-learning is the least discussed category in machine learning. Classical approaches involve a thorough inspection of purchased items utilizing trees or sets. Algorithms can seek for patterns but cannot generalize them to new cases.
In the real world, major retailers create their own proprietary solutions, so there will be no revolutions for you. The most advanced technology here is recommender systems. However, I may not be aware of a breakthrough in the field. Please share your thoughts in the comments below.
Reinforcement Learning
Finally, we see what appears to be actual artificial intelligence. In several papers, reinforcement learning is classified as halfway between supervised and unsupervised learning. They share nothing in common! Is this due to the name?
Reinforcement learning is employed when your challenge has nothing to do with data but requires you to live in a certain environment. Like a video game environment or a metropolis for self-driving vehicles.
Even knowing all the traffic regulations won’t educate an autopilot to drive safely. Even with extensive data collection, we cannot anticipate all probable outcomes. The idea is to reduce error rather than foresee every possible action.
Surviving in an environment is a fundamental concept of reinforcement learning.
Introduce the robot to real-life situations, punishing mistakes and rewarding good behavior. The same way we teach our children, correct?
A more effective approach is to create a simulated city where self-driving cars may learn their tactics first. This is how we now train auto-pilots. Create a virtual city based on a genuine map, fill with pedestrians, and train the automobile to kill as few as possible. After gaining confidence in the artificial GTA, the robot may go on to real-world testing. Fun!
There may be two approaches: model-based and model-free.
Model-based implies the automobile must learn a map or its components.
This technique is outmoded, as self-driving cars cannot memorize the entire earth.
Model-Free learning involves the automobile acting sensibly and generalizing events to maximize reward, rather than memorizing individual movements.
Remember how AI beat a top Go player?
The game has more possibilities than there are atoms in the universe, as previously demonstrated.
The machine’s inability to recall all possible combinations prevented it from winning Go, as it did in chess. It selected the optimal move for each case and outperformed a human opponent.
This method is fundamental to Q-learning and its variants (SARSA and DQN). The word ‘Q’ refers to a robot’s ability to do the most “qualitative” actions in each scenario, which is remembered using a simple Markovian process.
A computer may try billions of scenarios in a simulated environment and track which solutions resulted in higher rewards.
How does it discriminate between familiar circumstances and new ones? When a self-driving car is at a road crossing and the traffic signal turns green, does that indicate it may proceed? What if an ambulance rushes through surrounding streets?
Today, the answer is “no one knows”. There is no easy answer.
Researchers are continually hunting for it, but have only found workarounds. Some would manually code all circumstances to tackle uncommon occurrences, such as the trolley dilemma. Some prefer to rely on neural networks to solve problems. Q-learning evolved into Deep Q-Network (DQN). However, they are not a silver bullet.
Reinforcement Learning mimics the appearance of artificial intelligence to the typical person. It’s impressive that a machine can make real-life judgments! This topic is rapidly developing and integrating with neural networks to improve floor cleaning accuracy. Advancements in technology are very amazing!
Off-topic. Genetic algorithms were very popular while I was a student (link includes a beautiful graphic). This is about putting a group of robots in a single setting and letting them attempt to attain the goal till they die. Then we select the best, cross them, modify certain genes, and restart the simulation. After a few billion years, we will have an intelligent being. Probably. Evolution at its finest.
Genetic algorithms are classified as reinforcement learning, and their most crucial trait, as demonstrated by decades of experience, is that no one cares about them.
Humanity has yet to discover a task where these strategies outperform others. These tools are ideal for student experimentation and generating excitement among university supervisors about “artificial intelligence” without requiring much effort. YouTube would appreciate it as well.
Ensemble Method
“Bunch of stupid trees learning to correct errors of each other”
Today is used for:
- Everything that aligns with traditional algorithms yet performs better
- Search systems (★).
- Computer Vision
- Object detection
Popular algorithms: Random Forest, Gradient Boosting
It’s time to use current, mature ways. Ensembles and neural networks are key players in advancing towards a singularity.
These methods are commonly employed in production due to their high accuracy.
Today, neural networks have received a lot of attention, whereas terms like “boosting” and “bagging” are not commonly used on TechCrunch.
Despite their efficiency, the concept behind them is extremely basic. Combining inefficient algorithms and forcing them to fix each other’s faults might result in a system with greater overall quality than the best individual algorithms.
To achieve optimal results, choose the most unstable algorithms, which may anticipate wildly varied outcomes even with minor noise in input data. Consider Regression and Decision Trees. Outliers in input data might cause models to malfunction due to the sensitivity of these algorithms.
This is exactly what we need.
We can form an ensemble using whatever algorithm we are familiar with. To improve accuracy, combine classifiers with regression and assess the results. In my experience, avoid using Bayes or kNN in this situation. They are quite stable, despite being “dumb”. That is dull and predictable. Like your ex.
Instead, there are three battle-tested approaches for assembling ensembles.
Output from many simultaneous models is stacked and fed into the final decision-making model. For example, a girl may ask her pals whether they want to meet with you before making the final decision.
Emphasis is placed on the adjective “different”. Using the same algorithms on the same data does not make sense. You have total control over the algorithms used. For ultimate decision-making models, regression is typically a viable option.
Stacking is less commonly used in practice due to improved precision provided by other approaches.
Bagging, also known as Bootstrap AGGregatING. Train the same algorithm on different subsets of the original data. In the end, the replies were mediocre.
Data in random subsets can repeat. Subsets of a set such as “1-2-3” can include “2-2-3”, “1-2-2”, “3-1-2”, and more. We employ new datasets to educate the same algorithm many times and predict the final result using simple majority voting.
The Random Forest method is a well-known example of bagging, which involves bagging decision trees as seen above. Random Forest is responsible for painting boxes around people’s faces in the camera app on mobile devices. While neural networks are too sluggish for real-time use, bagging is a perfect solution since it can construct trees across several video card shaders or ML processors.
For certain applications, the Random Forest’s parallelism is more significant than a minor loss of accuracy during boosting. Particularly in real-time processing. There is always a trade-off.
Boosting algorithms are taught successively. The future iterations focus on data points that were incorrectly anticipated by the prior iteration. Repeat until you are satisfied.
Similar to bagging, we employ subsets of our data, but they are not randomly generated. In each subsample, we select a portion of the data that the preceding algorithm failed to handle. We create a new algorithm that learns from prior blunders.
The biggest benefit is the great precision of categorization, which is even prohibited in certain countries. This is something all cool kids may envy. Cons: lack of parallelism as already mentioned. However, it continues to outperform neural networks. It resembles a race between a dump truck and a racecar. While trucks have greater capabilities, if you need speed, a vehicle is a better option.
To see a real-world example of boosting, simply search on Facebook or Google. Can you hear trees roaring and crashing together to order results by relevance? This is because they employ boosting.
Neural Networks and Deep Leaning
If you’ve never been taught about neural networks using “human brain” analogies, you’re in luck. Tell me your secret. Let me explain it my way.
A neural network consists of neurons and connections between them. A neuron is a function with several inputs and one output. The job is to process all incoming numbers, apply a function, and output the results.
Here’s an example of a real-life neuron: add all input integers and return 1 if the sum exceeds N. Otherwise, zero.
Connections function similarly to channels between neurons. Neurons communicate by connecting their outputs and inputs to convey digits. Each link has only one parameter: weight. It functions similarly to a signal’s connection strength. A 0.5-weighted link converts the number 10 to 5.
Weights direct neurons to prioritize some inputs over others. Weights are modified throughout training to help the network learn. Essentially, that is all there is to it.
To prevent anarchy, neurons are connected through layers rather than randomly. Neurons inside a layer are not linked, however they do connect to neurons in the next and preceding layers. The network’s data flow is unidirectional, from the first layer’s inputs to the last layer’s output.
Applying a sufficient number of layers and correct weights can result in the following: applying the image of a handwritten digit 4, black pixels activate neurons, which activate the next layers, and so on, until the exit in charge of the four is lit up. The outcome has been attained.
A perceptron (MLP) network includes numerous layers and connects each neuron, making it the most basic design for beginners. I did not see it employed to solve production jobs.
After constructing a network, we need to assign adequate ways for neurons to respond to incoming inputs. Remember that our data consists of both ‘inputs’ and ‘outputs’. To train our network, we will show it a representation of the digit 4 and instruct it to change its weights to emit 4 whenever it sees the input.
To begin with, all weights are allocated randomly. When we show it a digit, it generates a random answer because the weights are not yet accurate. We then compare how much the result varies from the correct one.
After traversing the network backwards from outputs to inputs, we inform each neuron that their activation was unsuccessful, and to focus on other connections instead.
After hundreds of thousands of cycles of ‘infer-check-punish’, there is optimism that the weights will be fixed and function properly.
Backpropagation is a scientific term that refers to the process of propagating a mistake. Interestingly, it took 20 years to develop this strategy. Previously, neural networks were taught in many ways.
A well-trained neural network may mimic any of the techniques discussed in this chapter, sometimes with greater accuracy. This universality is what has made them so popular.
The proposed architecture of the human brain involves assembling several layers and training them on various data sets. The first AI winter began, then thawed, leading to further sadness.
Networks with several layers needed unprecedented computational power at the time. Geforce-enabled gaming PCs now surpass traditional data centers. At the time, individuals had little prospect of acquiring computational capability, and neural networks were a major disappointment.
Deep learning gained to prominence 10 years ago.
A machine learning timeline illustrates the cycle of optimism and pessimism.
In 2012, convolutional neural networks won the ImageNet competition, reigniting interest in deep learning approaches from the 1990s. We now have video cards!
Deep learning differs from traditional neural networks by introducing novel training methods for larger networks.
Theoretical studies now distinguish between deep and shallow learning. Practitioners use popular ‘deep’ frameworks such as Keras, TensorFlow, and PyTorch to create mini-networks with up to five layers. It is superior to previous instruments for its purpose. We just call them neural networks.
I’ll tell about two main kinds nowadays.
Convolutional Neural Networks (CNN)
Convolutional neural networks are really popular right now. They can search for things in photographs and videos, recognize faces, transfer styles, enhance images, add effects like slow-motion, and improve image quality. CNNs are commonly used to analyze images and videos. Several networks recognize items in your vicinity, including those on iPhone. If there’s anything to detect, heh.
Facebook just open-sourced Detectron, which created the image above.
Identifying characteristics from photographs has always been challenging. You may break text into phrases and look for word properties in specialist dictionaries. However, manually labeling photographs was necessary to educate the algorithm to identify cat ears or tails. Previously known as ‘handcrafting features’, this method was widely adopted.
There are several concerns with the handcrafting.
If a cat’s ears are down or turned away from the camera, the neural network cannot detect it.
Secondly, identify 10 distinguishing qualities between cats and other animals. I wouldn’t be able to accomplish it, but if I saw a black blob speeding by me at night, even if it’s just in the corner of my eye, I’d know what it is.
People consider other traits beyond only ear shape and leg count. This prevents the machine from comprehending the situation.
This implies that the machine must learn new characteristics independently, building on basic principles. We will perform the following: To begin, the image is divided into 8×8 pixel blocks. Each block is assigned a dominating line, such as horizontal [-], vertical [|], or diagonal [/]. It is possible for many items to be prominently shown, but we cannot always be certain.
The output consists of basic tables of sticks that represent the edges of items in the picture. They are images on their own, but constructed from sticks. Let’s see how an 8×8 block matches up again. And again…
The technique is named for the action it performs, convolution. Convolution may be viewed as a layer in a neural network, with each neuron capable of performing several functions.
Our neural network automatically allocates higher weights to stick combinations that appear often in images of cats. Whether it’s a simple line like a cat’s back or a complex shape like a cat’s face, something will elicit strong emotions.
We propose using a simple perceptron to distinguish between cats and dogs based on the most frequently triggered combinations.
This concept utilizes a neural network to automatically identify the most distinguishing aspects of items. We don’t need to choose them manually. By searching billions of photos, our net can construct feature maps and differentiate objects independently.
Recurrent Neural Networks (RNN)
The second most popular architectural style nowadays. Recurrent networks have enabled practical applications such as neural machine translation, speech recognition, and voice synthesis in smart assistants.
RNNs perform better with sequential data such as voice, text, or music.
Do you remember Microsoft Sam, the voice synthesizer in Windows XP? The humorous guy constructs words letter by letter, attempting to glue them together. Consider using Google Assistant or Amazon Alexa. They not only pronounce words well, but also use proper accents.
Modern voice assistants are taught to say whole sentences, rather than individual letters. By training a neural network on spoken texts, we may construct an audio sequence that closely resembles the original speech.
In other words, we enter text and produce sounds. We use a neural network to produce audio for a given text, compare it to the original, fix mistakes, and aim for optimal results.
Sounds like a traditional learning method. A perceptron is appropriate for this. But how should we specify the outputs? It’s not feasible to generate a single output for each potential sentence.
Text, voice, and music are all examples of sequences. They are
made up of sequential units, much like syllables. Although each sounds unique, it is influenced by previous ones. If you lose this connection, you will experience dubstep.
We can train the perceptron to make new sounds, but how does it remember prior answers? The goal is to add memory to each neuron and utilize it as an extra input during the following run. When a neuron detects a vowel, it may make a mental note that the next sound should be higher (this is a simplified example).
That is how recurrent networks appeared.
This strategy had a major drawback: when all neurons recalled their previous outcomes, the network’s number of connections became too large to alter all weights.
A neural network’s inability to forget hinders its ability to learn new information, similar to how humans do.
The first decision was simple: reduce neuronal memory. Assume you just need to remember five recent outcomes. However, it undermined the entire concept.
Later, a more effective technique was to employ specific cells comparable to computer memory. Each cell may store, read, and reset a number.
They were known as long and short-term memory (LSTM) cells.
When a neuron needs to establish a reminder, it places a flag in the cell. Such as “it was a consonant in a word, next time use different pronunciation rules” . When the flag is no longer required, the cells are reset, preserving only the “long-term” connections of the classical perceptron. The network is taught to learn weights and create reminders.
It’s simple, but it works!
You may collect voice samples from anyplace. BuzzFeed, for example, created a neural network to mimic Obama’s speech delivery. As you can see, audio synthesis is already a straightforward process. The video still has difficulties, but it’s a matter of time.
There are several network designs in the wild. I propose the article “Neural Network Zoo,” which provides a basic overview of several types of neural networks.
The End: when the war with the machines?
The notion of “when will machines become smarter than us and enslave everyone?” is fundamentally flawed.
There are just too many hidden conditions in it.
The phrase “become smarter than us” implies a single scale of intelligence. A person is at the top, followed by dogs, and pigeons at the bottom.
That is incorrect.
Contrary to popular belief, humans do not always outperform animals. Squirrels can recall thousands of hiding spots containing nuts, whereas humans struggle to locate their keys.
Is intelligence a collection of abilities rather than a measured value? Is knowing where nuts are stashed not considered intelligence?
Why do we assume the human brain has limited possibilities? Many popular graphs on the Internet show technical advancement as an exponent, but human potential remain constant. But is it?
Okay, mentally multiply 1680 by 950. I know you won’t try, lazy bastards. With a calculator, you can do the task in only two seconds. Does this imply that the calculator has enhanced your cognitive abilities?
If so, can I continue to extend them with other machines? For example, might I use my phone’s notes to avoid forgetting important information? I think I’m doing it right now. I’m leveraging machines to enhance my cognitive talents.
Think about it. Thank you for reading.