Essential Data Analytics Interview Questions: Ace Your Next Interview
What is the difference between Data Mining and Data Analysis?
Data Mining
- Used to identify patterns in stored data.
- Mining is performed on clean and well documented data.
- Results extracted from data mining are not easy to interpret.
Data Analysis
- Used to order & organize raw data in a meaningful manner.
- Data cleaning is part of the data analysis process. As a result, the data is not accurately recorded.
- Results extracted from data analysis are easy to interpret.
To recap, Data Mining is a technique for identifying patterns within recorded data. Analysts employ algorithms to identify patterns, which is commonly used in Machine Learning. Data analysis involves cleaning and organizing raw data to get insights.
What is the process of Data Analysis?
Data analysis involves gathering, cleaning, analyzing, converting, and modeling data to provide insights and reports for commercial profitability.
Refer to the graphic below to see the various phases in the procedure.
- Collect Data: Data is gathered from numerous sources and kept so that it may be cleansed and processed. This phase removes any missing values and outliers.
- Analyze Data: Once the data is prepared, the following step is to analyze it. A model is run frequently to improve. The mode is then verified to ensure that it fulfills the necessary business criteria.
- Create Reports: Finally, the model is implemented, and the generated reports are distributed to the stakeholders.
What is the difference between Data Mining and Data Profiling?
Data Mining is the process of analyzing data to identify previously unknown relationships. It primarily focuses on the detection of anomalous records, dependencies, and cluster analysis.
Data Profiling is the process of examining specific aspects of data. It primarily focuses on giving useful information about data properties such as data type, frequency, etc.
What is data cleansing and what are the best ways to practice data cleansing?
Data cleansing, wrangling, or cleaning. Everything means the same thing. It is the process of discovering and correcting faults in order to improve the quality of data. Refer to the figure below to learn about the various approaches of dealing with missing data.
What are the important steps in the data validation process?
As the name suggests, Data Validation is the process of verifying data. This stage involves two primary processes. These are data screening and verification.
- Data Screening: Various algorithms are employed in this stage to screen the full data and identify any incorrect numbers.
- Data Verification: Each suspicious value is reviewed against several use-cases before a final judgment is made on whether it should be included in the data or not.
What do you think are the criteria to say whether a developed data model is good or not?
The answer to this question may differ from person to person. However, following are a few factors that I believe must be evaluated to determine whether a produced data model is good or not:
- The model constructed for the dataset should exhibit predictable performance. This is essential to forecast the future.
- A model is regarded to be good if it is easily adaptable to modifications based on company needs.
- If the data changes, the model should be able to scale with it.
- The generated model should also be easy for clients to consume in order to provide actionable and lucrative results.
When do you think you should retrain a model? Is it dependent on the data?
Business data changes on a daily basis, but the format remains unchanged. When entering a new market, facing more competition, or experiencing a shift in position, it’s essential to retrain the business model. So, when company dynamics change, it is advised that the model be retrained to reflect changing client behaviors.
Can you mention a few problems that data analyst usually encounter while performing the analysis?
The following are some of the most common issues faced during data analysis.
- Duplicate entries and spelling errors diminish data quality.
- Extracting data from a bad source may require significant cleaning effort.
- When data is extracted from sources, its representation may differ. Now, when you mix data from different sources, the change in representation may cause a delay.
- Finally, missing data may provide an issue while performing data analysis.
What is the KNN imputation method?
This approach is used to impute missing attribute values using attribute values that are the most comparable to the missing attribute.
Distance functions are used to determine how similar the two qualities are.
Mention the name of the framework developed by Apache for processing large dataset for an application in a distributed computing environment?
The complete Hadoop Ecosystem was developed for processing large dataset for an application in a distributed computing environment. The Hadoop Ecosystem
consists of the following Hadoop components.
- HDFS -> Hadoop Distributed File System
- YARN -> Yet Another Resource Negotiator
- MapReduce -> Data processing using programming
- Spark -> In-memory Data Processing
- PIG, HIVE-> Data Processing Services using Query (SQL-like)
- HBase -> NoSQL Database
- Mahout, Spark MLlib -> Machine Learning
- Apache Drill -> SQL on Hadoop
- Zookeeper -> Managing Cluster
- Oozie -> Job Scheduling
- Flume, Sqoop -> Data Ingesting Services
- Solr & Lucene -> Searching & Indexing
- Ambari -> Provision, Monitor and Maintain cluster
Now we go on to the next set of questions, the Excel Interview Questions.
Data Analyst Interview Questions:
Excel
Microsoft Excel is one of the most easy and powerful software tools accessible today. It allows users to do quantitative and statistical analysis using an easy interface for data manipulation, therefore its applications cover several disciplines and professional needs. This is an essential field that provides a foundation for becoming a Data Analyst. Let’s quickly explore the questions raised about this issue.
Can you tell what is a waterfall chart and when do we use it?
The waterfall chart displays both positive and negative numbers that contribute to the ultimate outcome value. For instance, if examining a company’s net income, include all expense numbers in the chart. This type of chart visualizes the relationship between revenue and net income after deducting costs.
How can you highlight cells with negative values in Excel?
Excel’s conditional formatting allows you to highlight cells with negative values. Here are the steps you can take:
- Highlight the cells with negative values.
- Go to the Home tab and choose Conditional Formatting.
- Navigate to Highlight Cell Rules and select the Less Than option.
- In the Less Than dialog box, set the value to 0.
How can you clear all the formatting without actually removing the cell contents?
You may wish to delete all of the formatting and only keep the basic/simple facts. To accomplish this, use the ‘Clear Formats’ choices on the Home Tab. Clicking on the ‘Clear’ drop-down menu reveals the choice.
What is a Pivot Table, and what are the different sections of a Pivot Table?
A Pivot Table is a basic tool in Microsoft Excel that lets you easily summarize large datasets. just is quite simple to use, since just needs dragging and dropping row/column headings to generate reports.
A pivot table consists of four distinct sections:
The Values Area is where values are reported.
Rows Area: The headings located to the left of the values.
Column Area: The titles at the top of the values area define the columns area.
Filter Area: This is an optional filter for drilling down in the data collection.
Can you make a Pivot Table from multiple tables?
Yes, we can combine numerous Pivot Tables when there is a relationship between them.
How can we select all blank cells in Excel?
If you want to select all blank cells in Excel, utilize the Go To Special Dialog Box. Here are the steps you may take to select all of the blank cells in Excel.
- Select the complete dataset and press F5. This will open the Go To dialog box.
- Select the ‘Special‘ button to open the Go To special dialog box.
- After that, pick Blanks and click OK.
The final step involves selecting all of the blank cells in your dataset.
What are the most common questions you should ask a client before creating a dashboard?
The answer to this question varies case by instance. However, here are some frequent questions to ask while constructing a dashboard in Excel.
- Purpose of the Dashboards
- Multiple data sources
- Use of the Excel Dashboard
- The frequency at which the dashboard should be refreshed.
- The version of Office that the client uses.
What is a Print Area and how can you set it in Excel?
In Excel, a Print Area is a collection of cells that are set to print whenever the worksheet is printed. For example, if you just want to print the first 20 rows from the full worksheet, you may choose the first 20 rows as the Print Area.
Now, to configure the Print Area in Excel, follow the steps below:
- Select the cells to set the Print Area.
- Then, select the Page Layout tab.
- Select Print Area.
- Select Set Print Area.
What steps can you take to handle slow Excel workbooks?
There are several strategies to deal with sluggish Excel spreadsheets. However, there are a couple methods you may manage workbooks.
- Try manual calculation mode.
- Keep all referenced data in a single spreadsheet.
- Frequently utilize Excel tables and named ranges.
- Use helper columns rather than array formulae.
- Avoid utilizing complete rows or columns in references.
- Convert all unnecessary formulas into values.
Can you sort multiple columns at one time?
Multiple sorting is the process of sorting one column and then sorting another column while keeping the original column intact. In Excel, you may sort several columns at the same time.
To perform multiple sorting, utilize the Sort Dialog Box. To accomplish this, choose the data you wish to sort and then click on the Data Tab. After that, click the Sort icon.
You can provide the information for one column in this dialog box, then sort to another column by clicking the Add Level button.
Moving on to the following series of questions, which are connected to statistics.
Data Analyst Interview Questions: Statistics
Statistics is a discipline of mathematics that involves data gathering, organization, analysis, interpretation, and presentation. Statistics fall into two categories: Differential and Inferential. This discipline is connected to mathematics and provides a strong foundation for a career in data analysis.
What do you understand by the term Normal Distribution?
This is one of the most significant and extensively used distributions in statistics. Normal distributions, often known as the Bell Curve or Gaussian curve, indicate the range of values in terms of mean and standard deviation. Refer to the image below.
The figure above shows that data is often spread evenly around a central value, with no bias to either side. The random variables are distributed in a symmetrical bell-shaped curve.
What is A/B Testing?
A/B testing is a statistical hypothesis test for a randomized experiment involving two variables, A and B. Split testing uses sample statistics to estimate population parameters. This test compares two web sites by displaying two variations, A and B, to a similar number of visitors, and the one with the highest conversion rate wins.
The purpose of A/B testing is to see whether there are any modifications to the website. For example, consider a banner ad on which you have spent a significant amount of money. Determine the return on investment (ROI) by analyzing the click rate of banner ads.
What is the statistical power of sensitivity?
The statistical power of sensitivity is used to assess a classifier’s accuracy.
The classifier might be Logistic Regression, Support Vector Machine, or Random Forest.
To quantify sensitivity, consider the ratio of predicted true events to total events. True occurrences refer to occurrences that were both true and predicted by the model.
What is the Alternative Hypothesis?
To comprehend the Alternative Hypothesis, first understand what the null hypothesis is. The null hypothesis is a statistical phenomenon used to test for possible rejection on the premise that the outcome of chance would be true.
Following this, you might conclude that the alternative hypothesis is a statistical phenomena that contradicts the Null Hypothesis. Typically, it is assumed that the observations are the consequence of an effect with some degree of variability.
What is the difference between univariate, bivariate and multivariate analysis?
The distinctions among univariate, bivariate, and multivariate analyses are as follows:
Univariate : A descriptive statistical approach that differs depending on the number of variables involved at a specific point in time.
Bivariate : analysis is used to determine the difference between two variables at a time.
Multivariate : analysis is the study of many variables. This analysis is used to investigate the impact of factors on replies.
Can you tell me what are Eigenvectors and Eigenvalues?
Eigenvectors : They are mostly used to comprehend linear transformations. These are calculated for correlation and covariance matrices.
Eigenvectors are the directions in which a certain linear transformation works, such as flipping, compressing, or stretching.
Eigenvalue : Eigenvalues are the transformation’s strength or the component that causes compression in the direction of the eigenvector.
What is the difference between 1-Sample T-test, and 2-Sample T-test?
To answer this question, first explain what T-tests are. See below for an explanation of the T-test.
T-tests are hypothesis tests that compare means. Each test on sample data results in a single value, known as the T-value. Please see below for the formula.
Because this formula is in ratio format, you may describe it using the signal-to-noise ratio analogy.
The numerator would be a signal, whereas the denominator would be noise.
So, to compute the 1-Sample T-test, remove the null hypothesis value from the sample means. If your sample mean is 7 and the null hypothesis value is 2, the signal will be 5.
The difference between the sample mean and the null hypothesis is proportional to the signal intensity.
Now, look at the denominator, which is the noise in our example and is a measure of variability known as the standard error of the mean. This reflects your sample’s ability to reliably predict the population or dataset mean.
Noise has an indirect relationship with sample accuracy.
The T-Test 1 may now be calculated using the signal-to-noise ratio. This demonstrates how well your signal stands out from noise.
To calculate the 2-Sample Test, determine the ratio of the difference between the two samples to the null hypothesis.
To recap, the 1-Sample T-test compares a sample set to a mean, whereas the 2-Sample T-test assesses whether a mean difference between two sample sets is statistically significant for the total population or due to chance.
What are different types of Hypothesis Testing?
The many forms of hypothesis testing are listed below:
T-test: Used for unknown standard deviations and small sample sizes.
The Chi-Square Test for Independence assesses the connection between categorical variables in a population sample.
Analysis of Variance (ANOVA) is a hypothesis testing method that compares mean values across groups. This test is commonly used in the same way as a T-test, but for more than two groups.
The Welch’s T-test is used to determine if two population samples have equal means.
How to represent a Bayesian Network in the form of Markov Random Fields (MRF)?
Consider the following examples for representing a Bayesian Network as Markov Random Fields:
Consider two variables connected by an edge in a Bayesian network. We can then generate a probability distribution that factorizes into a probability of A and then a probability of B. In contrast, if we consider the same network in a Markov Random Field, it will be represented as a single potential function.
So that was a basic example to begin with. Now, consider a difficult situation where one variable is a parent of two others. A is the parent variable, pointing down to B and C. In this scenario, the probability distribution would include the likelihood of A, as well as the conditional probabilities of B and C given A. To transform this into a Markov Random Field, factorize the similarly organized network using possible functions for the A/B and A/C edges. Refer to the image below.