Data Science Job ready #DataScienceInterview Question and Answers 2022 | #DataScienceProjects

Basics
Resume Preparation
Basic Interview QnA
Math
SQL
Excel
Python
Machine Learning Basics
Numpy & Pandas
Django
Projects
Data Scientist- Job Material

Basics:
Data science is a deep study of the massive amount of data, which involves extracting meaningful insights from
raw, structured, and unstructured data that is processed using the scientific method, different technologies, and
algorithms.
Data Science is about data gathering, analysis and decision-making.
Data Science is about finding patterns in data, through analysis, and make future predictions.
In short, we can say that data science is all about:
• Asking the correct questions and analyzing the raw data.
• Modeling the data using various complex and efficient algorithms.
• Visualizing the data to get a better perspective.
• Understanding the data to make better decisions and finding the final result.

Basics:

Basics:
Example:
Let suppose we want to travel from
station A to station B by car. Now, we
need to take some decisions such as
which route will be the best route to
reach faster at the location, in which
route there will be no traffic jam, and
which will be cost-effective. All these
decision factors will act as input
data, and we will get an appropriate
answer from these decisions, so this
analysis of data is called the data
analysis, which is a part of data
science.

Basics:
With the help of data science technology,
we can convert the massive amount of
raw and unstructured data into
meaningful insights.
Data science technology is opting by
various companies, whether it is a big
brand or a startup. Google, Amazon,
Netflix, etc. which handle the huge
amount of data, are using data science
algorithms for better customer
experience.
Data science is working for automating
transportation such as creating a self-
driving car, which is the future of
transportation.
Data science can help in different
predictions such as various survey,
elections, flight ticket confirmation, etc.

Basics:
Types of Data Science Job:
If you learn data science, then you get the
opportunity to find the various exciting job roles in
this domain. The main job roles are given below:
• Data Scientist
• Data Analyst
• Machine learning expert
• Data engineer
• Data Architect
• Data Administrator
• Business Analyst
• Business Intelligence Manager
Below is the explanation of some critical job titles of
data science.
1. Data Analyst:
Data analyst is an individual, who performs mining of huge
amount of data, models the data, looks for patterns, relationship,
trends, and so on. At the end of the day, he comes up with
visualization and reporting for analyzing the data for decision
making and problem-solving process.
Skill required: For becoming a data analyst, you must get a good
background in mathematics, business intelligence, data mining,
and basic knowledge of statistics. You should also be familiar with
some computer languages and tools such as MATLAB, Python,
SQL, Hive, Pig, Excel, SAS, R, JS, Spark, etc.

Basics:
2. Machine Learning Expert:
The machine learning expert is the one who works with
various machine learning algorithms used in data
science such as regression, clustering, classification,
decision tree, random forest, etc.
Skill Required: Computer programming languages such
as Python, C++, R, Java, and Hadoop. You should also
have an understanding of various algorithms, problem-
solving analytical skill, probability, and statistics.
3. Data Scientist:
A data scientist is a professional who works with an
enormous amount of data to come up with compelling
business insights through the deployment of various tools,
techniques, methodologies, algorithms, etc.
Skill required: To become a data scientist, one should have
technical language skills such as R, SAS, SQL, Python, Hive,
Pig, Apache spark, MATLAB. Data scientists must have an
understanding of Statistics, Mathematics, visualization,
and communication skills.
Tools for Data Science
Following are some tools required for data science:
• Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R Studio, MATLAB, Excel,
RapidMiner.
• Data Warehousing: ETL, SQL, Hadoop, Informatica/Talend, AWS Redshift
• Data Visualization tools: R, Jupyter, Tableau, Cognos.
• Machine learning tools: Spark, Mahout, Azure ML studio.

Basics:
Prerequisite for Data Science
Non-Technical Prerequisite:
Curiosity: To learn data science, one must have
curiosities. When you have curiosity and ask various
questions, then you can understand the business
problem easily.
Critical Thinking: It is also required for a data scientist
so that you can find multiple new ways to solve the
problem with efficiency.
Communication skills: Communication skills are most
important for a data scientist because after solving a
business problem, you need to communicate it with
the team.
Technical Prerequisite:
Machine learning: To understand data science, one needs to
understand the concept of machine learning. Data science uses
machine learning algorithms to solve various problems.
Mathematical modeling: Mathematical modeling is required to
make fast mathematical calculations and predictions from the
available data.
Statistics: Basic understanding of statistics is required, such as
mean, median, or standard deviation. It is needed to extract
knowledge and obtain better results from the data.
Computer programming: For data science, knowledge of at least
one programming language is required. R, Python, Spark are
some required computer programming languages for data
science.
Databases: The depth understanding of Databases such as SQL, is
essential for data science to get the data and to work with data.

Basics:
How to solve a problem in Data Science using Machine learning algorithms?
Now, let's understand what are the most common types of problems occurred in data science and what is the approach to
solving the problems. So in data science, problems are solved using algorithms, and below is the diagram representation for
applicable algorithms for possible questions:

Basics:
Is this A or B? :
We can refer to this type of problem which has only two fixed solutions such as Yes or No, 1 or 0, may or may
not. And this type of problems can be solved using classification algorithms.
Is this different? :
We can refer to this type of question which belongs to various patterns, and we need to find odd from them.
Such type of problems can be solved using Anomaly Detection Algorithms.
How much or how many?
The other type of problem occurs which ask for numerical values or figures such as what is the time today, what
will be the temperature today, can be solved using regression algorithms.
How is this organized?
Now if you have a problem which needs to deal with the organization of data, then it can be solved using
clustering algorithms.
Clustering algorithm organizes and groups the data based on features, colors, or other common characteristics.

Basics:
Data Science life Cycle
• Here is how a Data Scientist works:
• Ask the right questions - To understand the business problem.
• Explore and collect data - From database, web logs, customer
feedback, etc.
• Extract the data - Transform the data to a standardized format.
• Clean the data - Remove erroneous values from the data.
• Find and replace missing values - Check for missing values and
replace them with a suitable value (e.g. an average value).
• Normalize data - Scale the values in a practical range (e.g. 140 cm
is smaller than 1,8 m. However, the number 140 is larger than 1,8.
- so scaling is important).
• Analyze data, find patterns and make future predictions.
• Represent the result - Present the result with useful insights in a
way the "company" can understand.

Basics:
The main phases of data science life cycle are given below:
1. Discovery: The first phase is discovery, which involves asking the right questions. When you start any data science project,
you need to determine what are the basic requirements, priorities, and project budget. In this phase, we need to determine
all the requirements of the project such as the number of people, technology, time, data, an end goal, and then we can frame
the business problem on first hypothesis level.
2. Data preparation: Data preparation is also known as Data Munging. In this phase, we need to perform the following tasks:
• Data cleaning
• Data Reduction
• Data integration
• Data transformation,
After performing all the above tasks, we can easily use this data for our further processes.
3. Model Planning: In this phase, we need to determine the various methods and techniques to establish the relation between
input variables. We will apply Exploratory data analytics(EDA) by using various statistical formula and visualization tools to
understand the relations between variable and to see what data can inform us. Common tools used for model planning are:
SQL Analysis Services
• R
• SAS
• Python

Basics:
4. Model-building: In this phase, the process of model building starts. We will create datasets for training and testing purpose.
We will apply different techniques such as association, classification, and clustering, to build the model.
Following are some common Model building tools:
o SAS Enterprise Miner
o WEKA
o SPCS Modeler
o MATLAB
5. Operationalize: In this phase, we will deliver the final reports of the project, along with briefings, code, and technical
documents. This phase provides you a clear overview of complete project performance and other components on a small
scale before the full deployment.
6. Communicate results: In this phase, we will check if we reach the goal, which we have set on the initial phase. We will
communicate the findings and final result with the business team.

Basics:
Applications of Data Science:
Image recognition and speech recognition:
Data science is currently using for Image and speech recognition. When you upload an image on Facebook and start
getting the suggestion to tag to your friends. This automatic tagging suggestion uses image recognition algorithm, which
is part of data science.
When you say something using, "Ok Google, Siri, Cortana", etc., and these devices respond as per voice control, so this is
possible with speech recognition algorithm.
Gaming world:
In the gaming world, the use of Machine learning algorithms is increasing day by day. EA Sports, Sony, Nintendo, are
widely using data science for enhancing user experience.
Internet search:
When we want to search for something on the internet, then we use different types of search engines such as Google,
Yahoo, Bing, Ask, etc. All these search engines use the data science technology to make the search experience better, and
you can get a search result with a fraction of seconds.

Basics:
Transport:
Transport industries also using data science technology to create self-driving cars. With self-driving cars, it will be easy
to reduce the number of road accidents.
Healthcare:
In the healthcare sector, data science is providing lots of benefits. Data science is being used for tumor detection, drug
discovery, medical image analysis, virtual medical bots, etc.
Recommendation systems:
Most of the companies, such as Amazon, Netflix, Google Play, etc., are using data science technology for making a
better user experience with personalized recommendations. Such as, when you search for something on Amazon, and
you started getting suggestions for similar products, so this is because of data science technology.
Risk detection:
Finance industries always had an issue of fraud and risk of losses, but with the help of data science, this can be
rescued.
Most of the finance companies are looking for the data scientist to avoid risk and any type of losses with an increase in
customer satisfaction.

Resume Preparation:
• Make a great first impression. By customizing your resume for the role you are applying to, you can tell the hiring
manager why you’re perfect for it. This will make a good first impression, setting you up for success in your interview.
• Stand apart from the crowd. Recruiters skim through hundreds of resumes on any given day. With a clear and
impactful data scientist resume, you can differentiate yourself from the crowd. By highlighting your unique
combination of skills and experience, you can make an impression beyond the standard checklist of technical skills
alone.
• Drive the interview conversation. Hiring managers and interviewers often bring your resume to the interview,
asking questions based on it. By including the right information in your resume, you can drive the conversation.
• Negotiate competitive pay. While a resume might not have a direct impact on the pay, it plays the role of a single
source of truth for your qualifications. By including all relevant skills and experience, you can make sure that the
offer is reflective of your value to the employer.

What Should You Include in Your Data Science Resume?
Name and Contact Information
Once the recruiter has seen your resume and you’re shortlisted, they would want to contact you. To make
this seamless, include your contact information clearly and prominently. But remember that this is simply
functional information. So, keep it concise. Double-check that it’s accurate.
Include:
Name
Email ID
Phone number
LinkedIn, portfolio, or GitHub profiles, if any

Career Objective/Summary
This is often the first section in any resume. As a fresh graduate, without much professional experience, the
career objective section acts as an indicator of what you would like to accomplish at the job you’re applying
to. On the other hand, if you have some experience, it is better to include a personal profile, summarizing
your skills and experiences.
A few things to keep in mind while writing your career objective/summary:
• Use this section to narrate your professional story, so paragraphs with complete sentences work better
than a bulleted list
• Mention the years of experience you have
• Provide information on the industry, function, and roles you have worked in

While creating your resume, it is sometimes better to write this section last. Making the rest of your data
scientist resume will help hone in on the right summary. Also, remember to customize your summary while
applying for the job. Not all jobs are the same, so your summary should reflect what you can do for the
particular role you’re applying to
Work Experience
As a practical field, work experience is more important in data science jobs than theoretical knowledge.
Therefore, this is the most crucial part of your resume.
If you are a fresh graduate, make sure to include any internships, personal projects, open-source
contributions you might have.

If you’re an experienced data scientist, spend enough time to tell your professional story clearly:
o List your work experience in reverse chronological order, with the most recent work listed on top and the
others following
o Indicate your designation, name of the company, and work period
o Write 1-2 lines about what you were responsible for
o Include the tasks you performed on a regular basis
o Demonstrate outcomes—if you have produced quantifiable results, be sure to include them. For
instance: “I built a production prediction engine in Python page that helped reduce crude oil profit loss
by 22%”
o Add accomplishments like awards and recognitions, if any.
Layout-wise, follow consistency within this section. For instance, if you use bullets to list your tasks, use
them uniformly across all your job titles.

Projects
Showing your hiring manager a peek into the work you’ve done is a great way to demonstrate your capabilities. The
projects section can be used for that. While deciding which of your projects to include in your resume, consider the
following:
Relevance. You might have worked on several projects, but the most valuable are the ones that are relevant to the role
that you’re applying to. So, pick the most relevant 2-3 projects you’ve worked on.
Write a summary. Write 1-2 lines about the business context and your work. It helps to show that you know how to use
technical skills to achieve business outcomes.
Show technical expertise. Also include a short list of the tools, technologies, and processes you used to complete the
project.
It is also an option to write a detailed case study of your projects on a blog or Medium and link it here.

Skills
The first person to see your resume is often a recruiter who might not have the technical skills to evaluate. So, they
typically try to match every resume to the job description to identify if the candidate has the skills necessary. Some
organizations also use an applicant tracking system (ATS) to automate the screening. Therefore, it is important that your
resume list the skills the job description demands.
Keep it short
Include all the skills you have that the job description demands
Even if you have mentioned it in the experience or summary section, repeat it here
Education
So, keep this section concise and clear.
List post-secondary degrees in your education section (i.e., community college, college, and graduate degrees)
Include the year of graduation
If you’re a fresh graduate, you can mention subjects you’ve studied that are relevant to the job you’re applying to
If you have a certification or have completed an online course in data science or related subjects, make sure to include
them as well

Senior Data Scientist Resume Examples
What To Include
As a senior data scientist with experience, you would be aiming for a position with more responsibility, like a data
science manager, for example. This demands a customized and confident resume.
o Customize the resume for the job you’re applying to—highlight relevant skills/experience, mirror the job description
o Focus on responsibilities and accomplishments instead of tasks
o Include business outcomes you’ve produced with your work
o Present case studies of your key projects
Why is this resume good?
The information is organized in a clear and concise manner giving the entire view of the candidate’s career without
overwhelming the reader
Each job has quantifiable outcomes, demonstrating the business acumen of the candidate
Also subtly hints at leadership skills by mentioning the responsibilities taken in coaching and leading teams

Senior Data Scientist Resume Examples

Basic Interview QnA:
1. What is Data Science?
Data Science is a combination of algorithms, tools, and machine learning technique which helps you to find
common hidden patterns from the given raw data.
2. What is logistic regression in Data Science?
Logistic Regression is also called as the logit model. It is a method to forecast the binary outcome from a linear
combination of predictor variables.
3. Name three types of biases that can occur during sampling
In the sampling process, there are three types of biases, which are:
Selection bias
Under coverage bias
Survivorship bias
4. Discuss Decision Tree algorithm
A decision tree is a popular supervised machine learning algorithm. It is mainly used for Regression and
Classification. It allows breaks down a dataset into smaller subsets. The decision tree can able to handle both
categorical and numerical data.

5. What is Prior probability and likelihood?
Prior probability is the proportion of the dependent variable in the data set while the likelihood is the probability of
classifying a given observant in the presence of some other variable.
6. Explain Recommender Systems?
It is a subclass of information filtering techniques. It helps you to predict the preferences or ratings which users
likely to give to a product.
7. Name three disadvantages of using a linear model
Three disadvantages of the linear model are:
The assumption of linearity of the errors.
You can’t use this model for binary or count outcomes
There are plenty of overfitting problems that it can’t solve

8. Why do you need to perform resampling?
Resampling is done in below-given cases:
Estimating the accuracy of sample statistics by drawing randomly with replacement from a set of the data point or
using as subsets of accessible data
Substituting labels on data points when performing necessary tests
Validating models by using random subsets
9. List out the libraries in Python used for Data Analysis and Scientific Computations.
• SciPy
• Pandas
• Matplotlib
• NumPy
• SciKit
• Seaborn
10. What is Power Analysis?
The power analysis is an integral part of the experimental design. It helps you to determine the sample size requires
to find out the effect of a given size from a cause with a specific level of assurance. It also allows you to deploy a
particular probability in a sample size constraint.

11. Explain Collaborative filtering
Collaborative filtering used to search for correct patterns by collaborating viewpoints, multiple data sources, and
various agents.
12. What is bias?
Bias is an error introduced in your model because of the oversimplification of a machine learning algorithm.” It can
lead to under-fitting.
13. Discuss ‘Naive’ in a Naive Bayes algorithm?
The Naive Bayes Algorithm model is based on the Bayes Theorem. It describes the probability of an event. It is based
on prior knowledge of conditions which might be related to that specific event.
14. What is a Linear Regression?
Linear regression is a statistical programming method where the score of a variable ‘A’ is predicted from the score of
a second variable ‘B’. B is referred to as the predictor variable and A as the criterion variable.
15. State the difference between the expected value and mean value
They are not many differences, but both of these terms are used in different contexts. Mean value is generally
referred to when you are discussing a probability distribution whereas expected value is referred to in the context
of a random variable.

16. What the aim of conducting A/B Testing?
AB testing used to conduct random experiments with two variables, A and B. The goal of this testing method is to
find out changes to a web page to maximize or increase the outcome of a strategy.
17. What is Ensemble Learning?
The ensemble is a method of combining a diverse set of learners together to improvise on the stability and
predictive power of the model. Two types of Ensemble learning methods are:
Bagging
Bagging method helps you to implement similar learners on small sample populations. It helps you to make nearer
predictions.
Boosting
Boosting is an iterative method which allows you to adjust the weight of an observation depends upon the last
classification. Boosting decreases the bias error and helps you to build strong predictive models.
18. Explain Eigenvalue and Eigenvector
Eigenvectors are for understanding linear transformations. Data scientist need to calculate the eigenvectors for a
covariance matrix or correlation. Eigenvalues are the directions along using specific linear transformation acts by
compressing, flipping, or stretching.

19. Define the term cross-validation
Cross-validation is a validation technique for evaluating how the outcomes of statistical analysis will generalize for an
Independent dataset. This method is used in backgrounds where the objective is forecast, and one needs to estimate
how accurately a model will accomplish.
20. Explain the steps for a Data analytics project
The following are important steps involved in an analytics project:
Understand the Business problem
Explore the data and study it carefully.
Prepare the data for modeling by finding missing values and transforming variables.
Start running the model and analyze the Big data result.
Validate the model with new data set.
Implement the model and track the result to analyze the performance of the model for a specific period.
21. Discuss Artificial Neural Networks
Artificial Neural networks (ANN) are a special set of algorithms that have revolutionized machine learning. It helps
you to adapt according to changing input. So the network generates the best possible result without redesigning the
output criteria.

22. What is Back Propagation?
Back-propagation is the essence of neural net training. It is the method of tuning the weights of a neural net depend
upon the error rate obtained in the previous epoch. Proper tuning of the helps you to reduce error rates and to make the
model reliable by increasing its generalization.
23. What is a Random Forest?
Random forest is a machine learning method which helps you to perform all types of regression and classification tasks. It
is also used for treating missing values and outlier values.
24. What is the importance of having a selection bias?
Selection Bias occurs when there is no specific randomization achieved while picking individuals or groups or data to be
analyzed. It suggests that the given sample does not exactly represent the population which was intended to be analyzed.
25. What is the K-means clustering method?
K-means clustering is an important unsupervised learning method. It is the technique of classifying data using a certain
set of clusters which is called K clusters. It is deployed for grouping to find out the similarity in the data.
26. Explain the difference between Data Science and Data Analytics
Data Scientists need to slice data to extract valuable insights that a data analyst can apply to real-world business
scenarios. The main difference between the two is that the data scientists have more technical knowledge then business
analyst. Moreover, they don’t need an understanding of the business required for data visualization.

27. Explain p-value?
When you conduct a hypothesis test in statistics, a p-value allows you to determine the strength of
your results. It is a numerical number between 0 and 1. Based on the value it will help you to denote
the strength of the specific result.
28. Define the term deep learning
Deep Learning is a subtype of machine learning. It is concerned with algorithms inspired by the
structure called artificial neural networks (ANN).
29. Explain the method to collect and analyze data to use social media to predict the weather condition.
You can collect social media data using Facebook, twitter, Instagram’s API’s. For example, for the
tweeter, we can construct a feature from each tweet like tweeted date, retweets, list of follower, etc.
Then you can use a multivariate time series model to predict the weather condition.
30. When do you need to update the algorithm in Data science?
You need to update an algorithm in the following situation:
You want your data model to evolve as data streams using infrastructure
The underlying data source is changing If it is non-stationarity

31. What is Normal Distribution
A normal distribution is a set of a continuous variable spread across a normal curve or in the shape of a bell curve. You can
consider it as a continuous probability distribution which is useful in statistics. It is useful to analyze the variables and their
relationships when we are using the normal distribution curve.
32. Which language is best for text analytics? R or Python?
Python will more suitable for text analytics as it consists of a rich library known as pandas. It allows you to use high-level
data analysis tools and data structures, while R doesn’t offer this feature.
33. Explain the benefits of using statistics by Data Scientists
Statistics help Data scientist to get a better idea of customer’s expectation. Using the statistic method Data Scientists can
get knowledge regarding consumer interest, behavior, engagement, retention, etc. It also helps you to build powerful data
models to validate certain inferences and predictions.
34. Name various types of Deep Learning Frameworks
o Pytorch
o Microsoft Cognitive Toolkit
o TensorFlow
o Caffe
o Chainer
o Keras

35.Explain Auto-Encoder
Autoencoders are learning networks. It helps you to transform inputs into outputs with fewer numbers of errors. This
means that you will get output to be as close to input as possible.
36. Define Boltzmann Machine
Boltzmann machines is a simple learning algorithm. It helps you to discover those features that represent complex
regularities in the training data. This algorithm allows you to optimize the weights and the quantity for the given
problem.
37. Explain why Data Cleansing is essential and which method you use to maintain clean data
Dirty data often leads to the incorrect inside, which can damage the prospect of any organization. For example, if you
want to run a targeted marketing campaign. However, our data incorrectly tell you that a specific product will be in-
demand with your target audience; the campaign will fail.
38. What is skewed Distribution & uniform distribution?
Skewed distribution occurs when if data is distributed on any one side of the plot whereas uniform distribution is
identified when the data is spread is equal in the range.
39. When under-fitting occurs in a static model?
Under-fitting occurs when a statistical model or machine learning algorithm not able to capture the underlying trend of
the data.

40. What is reinforcement learning?
Reinforcement Learning is a learning mechanism about how to map situations to actions. The end result should help
you to increase the binary reward signal. In this method, a learner is not told which action to take but instead must
discover which action offers a maximum reward. As this method based on the reward/penalty mechanism.
41. Name commonly used algorithms.
Four most commonly used algorithm by Data scientist are:
Linear regression
Logistic regression
Random Forest
KNN
42. What is precision?
Precision is the most commonly used error metric is n classification mechanism. Its range is from 0 to 1, where 1 represents
100%
43. What is a univariate analysis?
An analysis which is applied to none attribute at a time is known as univariate analysis. Boxplot is widely used, univariate
model.

44. How do you overcome challenges to your findings?
In order, to overcome challenges of my finding one need to encourage discussion, Demonstrate leadership and
respecting different options.
45. Explain cluster sampling technique in Data science
A cluster sampling method is used when it is challenging to study the target population spread across, and simple
random sampling can’t be applied.
46. State the difference between a Validation Set and a Test Set
A Validation set mostly considered as a part of the training set as it is used for parameter selection which helps
you to avoid overfitting of the model being built.
While a Test Set is used for testing or evaluating the performance of a trained machine learning model.
47. Explain the term Binomial Probability Formula?
“The binomial distribution contains the probabilities of every possible success on N trials for independent events
that have a probability of π of occurring.”

48. What is a recall?
A recall is a ratio of the true positive rate against the actual positive rate. It ranges from 0 to 1.
49. Discuss normal distribution
Normal distribution equally distributed as such the mean, median and mode are equal.
50. While working on a data set, how can you select important variables? Explain
Following methods of variable selection you can use:
Remove the correlated variables before selecting important variables
Use linear regression and select variables which depend on that p values.
Use Backward, Forward Selection, and Stepwise Selection
Use Xgboost, Random Forest, and plot variable importance chart.
Measure information gain for the given set of features and select top n features accordingly.
51. Is it possible to capture the correlation between continuous and categorical variable?
Yes, we can use analysis of covariance technique to capture the association between continuous and categorical variables.
52. Treating a categorical variable as a continuous variable would result in a better predictive model?
Yes, the categorical value should be considered as a continuous variable only when the variable is ordinal in nature. So it is a
better predictive model.

Math Interview QnA:
Q: When should you use a t-test vs a z-test?
A Z-test is a hypothesis test with a normal distribution that uses a z-statistic. A z-test is used when you know the population
variance or if you don’t know the population variance but have a large sample size.
A T-test is a hypothesis test with a t-distribution that uses a t-statistic. You would use a t-test when you don’t know the
population variance and have a small sample size.
You can see the image below as a reference to guide which test you should use:

Q:How would you describe what a ‘p-value’ is to a non-technical person?
The best way to describe the p-value in simple terms is with an example. In practice, if the p-value is less than the alpha, say of
0.05, then we’re saying that there’s a probability of less than 5% that the result could have happened by chance. Similarly, a p-
value of 0.05 is the same as saying “5% of the time, we would see this by chance.”

Q: What is cherry-picking, P-hacking, and significance chasing?
Cherry picking refers to the practice of only selecting data or information that supports one’s desired conclusion.
P-hacking refers to when one manipulates his/her data collection or analysis until non-significant results become significant.
This includes deciding mid-test to not collect anymore data.
Significance chasing refers to when a researcher reports insignificant results as if they’re “almost” significant.
Q: What is the assumption of normality?
The assumption of normality is the sampling distribution is normal and centers around the population parameter, according to
the central limit theorem.
Q: What is the central limit theorem and why is it so important?
The central limit theorem is very powerful — it states that the distribution of sample means approximates a normal
distribution.
To give an example, you would take a sample from a data set and calculate the mean of that sample. Once repeated multiple
times, you would plot all your means and their frequencies onto a graph and see that a bell curve, also known as a normal
distribution, has been created. The mean of this distribution will closely resemble that of the original data.
The central limit theorem is important because it is used in hypothesis testing and also to calculate confidence intervals.

Q: What is the empirical rule?
The empirical rule states that if a dataset is normally distributed, 68% of the data will fall within one standard deviation, 95%
of the data will fall within two standard deviations, and 99.7% of the data will fall within 3 standard deviations.
Q: What general conditions must be satisfied for the central limit theorem to hold?
The data must be sampled randomly
The sample values must be independent of each other
The sample size must be sufficiently large, generally it should be greater or equal than 30
Q: What is the difference between a combination and a permutation?
A permutation of n elements is any arrangement of those n elements in a definite order. There are n factorial (n!) ways to
arrange n elements. Note the bold: order matters! The number of permutations of n things taken r-at-a-time is defined as
the number of r-tuples that can be taken from n different elements and is equal to the following equation:

On the other hand, combinations refer to the number of ways to choose r out of n objects where order doesn’t matter. The
number of combinations of n things taken r-at-a-time is defined as the number of subsets with r elements of a set with n
elements and is equal to the following equation:
Q: How many permutations does a license plate have with 6 digits?

On the other hand, combinations refer to the number of ways to choose r out of n objects where order doesn’t matter. The
number of combinations of n things taken r-at-a-time is defined as the number of subsets with r elements of a set with n
elements and is equal to the following equation:
Q: How are confidence tests and hypothesis tests similar? How are they different?
Confidence intervals and hypothesis testing are both tools used for to make statistical inferences.
The confidence interval suggests a range of values for an unknown parameter and is then associated with a confidence level
that the true parameter is within the suggested range of. Confidence intervals are often very important in medical research
to provide researchers with a stronger basis for their estimations. A confidence interval can be shown as “10 +/- 0.5” or [9.5,
10.5] to give an example.
Hypothesis testing is the basis of any research question and often comes down to trying to prove something did not happen
by chance. For example, you could try to prove when rolling a dye, one number was more likely to come up than the rest.

Q: What is the difference between observational and experimental data?
Observational data comes from observational studies which are when you observe certain variables and try to determine if
there is any correlation.
Experimental data comes from experimental studies which are when you control certain variables and hold them constant to
determine if there is any causality.
An example of experimental design is the following: split a group up into two. The control group lives their lives normally. The
test group is told to drink a glass of wine every night for 30 days. Then research can be conducted to see how wine affects
sleep.
Q: Give some examples of some random sampling techniques
Simple random sampling requires using randomly generated numbers to choose a sample. More specifically, it initially requires
a sampling frame, a list or database of all members of a population. You can then randomly generate a number for each
element, using Excel for example, and take the first n samples that you require.
Systematic sampling can be even easier to do, you simply take one element from your sample, skip a predefined amount (n)
and then take your next element. Going back to our example, you could take every fourth name on the list.

Cluster sampling starts by dividing a population into groups, or clusters. What makes this different that stratified sampling is
that each cluster must be representative of the population. Then, you randomly selecting entire clusters to sample. For
example, if an elementary school had five different grade eight classes, cluster random sampling might be used and only one
class would be chosen as a sample, for example.
Stratified random sampling starts off by
dividing a population into groups with similar
attributes. Then a random sample is taken
from each group. This method is used to
ensure that different segments in a
population are equally represented. To give
an example, imagine a survey is conducted at
a school to determine overall satisfaction. It
might make sense here to use stratified
random sampling to equally represent the
opinions of students in each department.

Q: What is the difference between type 1 error and type 2 error?
A type 1 error is when you incorrectly reject a true null hypothesis. It’s also called a false positive.
A type 2 error is when you don’t reject a false null hypothesis. It’s also called a false negative.
Q: What is the power of a test? What are two ways to increase the power of a test?
The power of a test is the probability of rejecting the null hypothesis when it’s false. It’s also equal to 1 minus the beta.
To increase the power of the test, you can do two things:
You can increase alpha, but it also increases the chance of a type 1 error
Increase the sample size, n. This maintains the type 1 error but reduces type 2.
Q: What is the Law of Large Numbers?
The Law of Large Numbers is a theory that states that as the number of trials increases, the average of the result will become
closer to the expected value.
Eg. flipping heads from fair coin 100,000 times should be closer to 0.5 than 100 times.
Q: What is the Pareto principle?
The Pareto principle, also known as the 80/20 rule states that 80% of the effects come from 20% of the causes. Eg. 80% of
sales come from 20% of customers.

Q: What is a confounding variable?
A confounding variable, or a confounder, is a variable that influences both the dependent variable and the independent
variable, causing a spurious association, a mathematical relationship in which two or more variables are associated but not
causally related.
Q: What are the assumptions required for linear regression?
There are four major assumptions:
There is a linear relationship between the dependent variables and the regressors, meaning the model you are creating
actually fits the data
The errors or residuals of the data are normally distributed and independent from each other
There is minimal multicollinearity between explanatory variables
Homoscedasticity. This means the variance around the regression line is the same for all values of the predictor variable.
Q: What does interpolation and extrapolation mean? Which is generally more accurate?
Interpolation is a prediction made using inputs that lie within the set of observed values. Extrapolation is when a prediction
is made using an input that’s outside the set of observed values.
Generally, interpolations are more accurate.

Q: What does autocorrelation mean?
Autocorrelation is when future outcomes depend on previous outcomes. When there is autocorrelation, the errors show a
sequential pattern and the model is less accurate.
Q: When you sample, what potential biases can you be inflicting?
Potential biases include the following:
Sampling bias: a biased sample caused by non-random sampling
Under coverage bias: sampling too few observations
Survivorship bias: error of overlooking observations that did not make it past a form of selection process.
Q: What is an outlier? Explain how you might screen for outliers and what would you do if you found them in your dataset.
An outlier is a data point that differs significantly from other observations.
Depending on the cause of the outlier, they can be bad from a machine learning perspective because they can worsen the
accuracy of a model. If the outlier is caused by a measurement error, it’s important to remove them from the dataset.

Q: What is an inlier?
An inlier is a data observation that lies within the rest of the dataset and is unusual or an error. Since it lies in the dataset, it is
typically harder to identify than an outlier and requires external data to identify them. Should you identify any inliers, you can
simply remove them from the dataset to address them.
Q: You roll a biased coin (p(head)=0.8) five times. What’s the probability of getting three or more heads?
Use the General Binomial Probability formula to answer this question:
p = 0.8
n = 5
k = 3,4,5
P(3 or more heads) = P(3 heads) + P(4 heads) + P(5 heads) = 0.94 or 94%

Q: A random variable X is normal with mean 1020 and a standard deviation 50. Calculate P(X>1200)
Using Excel…
p =1-norm.dist(1200, 1020, 50, true)
p= 0.000159
Q: Consider the number of people that show up at a bus station is Poisson with mean 2.5/h. What is the probability that at most
three people show up in a four hour period?
x = 3
mean = 2.5*4 = 10
using Excel…
p = poisson.dist(3,10,true)
p = 0.010336
Q: Suppose that diastolic blood pressures (DBPs) for men aged 35–44 are normally distributed with a mean of 80 (mm Hg) and a
standard deviation of 10. About what is the probability that a random 35–44 year old has a DBP less than 70?
Since 70 is one standard deviation below the mean, take the area of the Gaussian distribution to the left of one standard
deviation. = 2.3 + 13.6 = 15.9%

Q: Give an example where the median is a better measure than the mean
When there are a number of outliers that positively or negatively skew the data.
Q: Given two fair dices, what is the probability of getting scores that sum to 4? to 8?
There are 4 combinations of rolling a 4 (1+3, 3+1, 2+2):
P(rolling a 4) = 3/36 = 1/12
There are combinations of rolling an 8 (2+6, 6+2, 3+5, 5+3, 4+4):
P(rolling an 8) = 5/36
Q: If a distribution is skewed to the right and has a median of 30, will the mean be greater than or less than 30?
If the given distribution is a right-skewed distribution, then the mean should be greater than 30, while the mode remains to be
less than 30.

SQL Interview QnA:
MySQL Create Table Example
Below is a MySQL example to create a table in database:
CREATE TABLE IF NOT EXISTS `MyFlixDB`.`Members` (
`membership_number` INT AUTOINCREMENT ,
`full_names` VARCHAR(150) NOT NULL ,
`gender` VARCHAR(6) ,
`date_of_birth` DATE ,
`physical_address` VARCHAR(255) ,
`postal_address` VARCHAR(255) ,
`contact_number` VARCHAR(75) ,
`email` VARCHAR(255) ,
PRIMARY KEY (`membership_number`) )
ENGINE = InnoDB;

Let’s see a query for creating a table which has data
of all data types. Study it and identify how each data
type is defined in the below create table MySQL
example.
CREATE TABLEàll_data_types` (
`varchar` VARCHAR( 20 ) ,
`tinyint` TINYINT ,
`text` TEXT ,
`date` DATE ,
`smallint` SMALLINT ,
`mediumint` MEDIUMINT ,
ìnt` INT ,
`bigint` BIGINT ,
`float` FLOAT( 10, 2 ) ,
`double` DOUBLE ,
`decimal` DECIMAL( 10, 2 ) ,
`datetime` DATETIME ,
`timestamp` TIMESTAMP ,
`time` TIME ,
`year` YEAR ,
`char` CHAR( 10 ) ,
`tinyblob` TINYBLOB ,
`tinytext` TINYTEXT ,
`blob` BLOB ,
`mediumblob` MEDIUMBLOB ,
`mediumtext` MEDIUMTEXT ,
`longblob` LONGBLOB ,
`longtext` LONGTEXT ,
ènum` ENUM( '1', '2', '3' ) ,
`set` SET( '1', '2', '3' ) ,
`bool` BOOL ,
`binary` BINARY( 20 ) ,
`varbinary` VARBINARY( 20 )
) ENGINE= MYISAM ;

1. What is DBMS?
A Database Management System (DBMS) is a program that controls creation, maintenance and use of a database. DBMS can be
termed as File Manager that manages data in a database rather than saving it in file systems.
2. What is RDBMS?
RDBMS stands for Relational Database Management System. RDBMS store the data into the collection of tables, which is
related by common fields between the columns of the table. It also provides relational operators to manipulate the data stored
into the tables.
Example: SQL Server.
3. What is SQL?
SQL stands for Structured Query Language , and it is used to communicate with the Database. This is a standard language used
to perform tasks such as retrieval, updation, insertion and deletion of data from a database.
Standard SQL Commands are Select.

4. What is a Database?
Database is nothing but an organized form of data for easy access, storing, retrieval and managing of data. This is also known as
structured form of data which can be accessed in many ways.
Example: School Management Database, Bank Management Database.
5. What are tables and Fields?
A table is a set of data that are organized in a model with Columns and Rows. Columns can be categorized as vertical, and Rows
are horizontal. A table has specified number of column called fields but can have any number of rows which is called record.
Example:.
Table: Employee.
Field: Emp ID, Emp Name, Date of Birth.
Data: 201456, David, 11/15/1960.

6. What is a primary key?
A primary key is a combination of fields which uniquely specify a row. This is a special kind of unique key, and it has implicit NOT
NULL constraint. It means, Primary key values cannot be NULL.
7. What is a unique key?
A Unique key constraint uniquely identified each record in the database. This provides uniqueness for the column or set of
columns.
A Primary key constraint has automatic unique constraint defined on it. But not, in the case of Unique Key.
There can be many unique constraint defined per table, but only one Primary key constraint defined per table.
8. What is a foreign key?
A foreign key is one table which can be related to the primary key of another table. Relationship needs to be created between
two tables by referencing foreign key with the primary key of another table.
9. What is a join?
This is a keyword used to query data from more tables based on the relationship between the fields of the tables. Keys play a
major role when JOINs are used.

10. What are the types of join and explain each?
There are various types of join which can be used to retrieve data and it depends on the relationship between tables.
Inner Join.
Inner join return rows when there is at least one match of rows between the tables.
Right Join.
Right join return rows which are common between the tables and all rows of Right hand side table. Simply, it returns all the
rows from the right hand side table even though there are no matches in the left hand side table.
Left Join.
Left join return rows which are common between the tables and all rows of Left hand side table. Simply, it returns all the rows
from Left hand side table even though there are no matches in the Right hand side table.
Full Join.
Full join return rows when there are matching rows in any one of the tables. This means, it returns all the rows from the left
hand side table and all the rows from the right hand side table.
11. What is normalization?
Normalization is the process of minimizing redundancy and dependency by organizing fields and table of a database. The main
aim of Normalization is to add, delete or modify field that can be made in a single table.

12. What is Denormalization?
DeNormalization is a technique used to access the data from higher to lower normal forms of database. It is also process of
introducing redundancy into a table by incorporating data from the related tables.
14. What is a View?
A view is a virtual table which consists of a subset of data contained in a table. Views are not virtually present, and it takes less
space to store. View can have data of one or more tables combined, and it is depending on the relationship.
15. What is an Index?
An index is performance tuning method of allowing faster retrieval of records from the table. An index creates an entry for each
value and it will be faster to retrieve data.
What is a trigger?
A DB trigger is a code or programs that automatically execute with response to some event on a table or view in a database.
Mainly, trigger helps to maintain the integrity of the database.
Example: When a new student is added to the student database, new records should be created in the related tables like Exam,
Score and Attendance tables.

What is the difference between DELETE and TRUNCATE commands?
DELETE command is used to remove rows from the table, and WHERE clause can be used for conditional set of parameters.
Commit and Rollback can be performed after delete statement.
TRUNCATE removes all rows from the table. Truncate operation cannot be rolled back.
What are local and global variables and their differences?
Local variables are the variables which can be used or exist inside the function. They are not known to the other functions and
those variables cannot be referred or used. Variables can be created whenever that function is called.
Global variables are the variables which can be used or exist throughout the program. Same variable declared in global cannot
be used in functions. Global variables cannot be created whenever that function is called.
What is the difference between TRUNCATE and DROP statements?
TRUNCATE removes all the rows from the table, and it cannot be rolled back. DROP command removes a table from the
database and operation cannot be rolled back.

What are aggregate and scalar functions?
Aggregate functions are used to evaluate mathematical calculation and return single values. This can be calculated from the
columns in a table. Scalar functions return a single value based on the input value.
Example -.
Aggregate – max(), count – Calculated with respect to numeric.
Scalar – UCASE(), NOW() – Calculated with respect to strings.
How can you create an empty table from an existing table?
Example will be -.
Select * into studentcopy from student where 1=2
Here, we are copying student table to another table with the same structure with no rows copied.
How to fetch common records from two tables?
Common records result set can be achieved by -.
Select studentID from student INTERSECT Select StudentID from Exam

How to fetch alternate records from a table?
Records can be fetched for both Odd and Even row numbers -
To display even numbers-.
Select studentId from (Select rowno, studentId from student) where mod(rowno,2)=0
To display odd numbers-.
Select studentId from (Select rowno, studentId from student) where mod(rowno,2)=1
How to select unique records from a table?
Select unique records from a table by using DISTINCT keyword.
Select DISTINCT StudentID, StudentName from Student.
What is the command used to fetch first 5 characters of the string?
There are many ways to fetch first 5 characters of the string -.
Select SUBSTRING(StudentName,1,5) as studentname from student
Select LEFT(Studentname,5) as studentname from student

Which operator is used in query for pattern matching?
LIKE operator is used for pattern matching, and it can be used as -.
1. % – Matches zero or more characters.
2. _(Underscore) – Matching exactly one character.
Example -
Select * from Student where studentname like 'a%'
Select * from Student where studentname like 'ami_‘
Summary
• Creating a database involves translating the logical database design model into the physical database.
• MySQL supports a number of data types for numeric, dates and strings values.
• CREATE DATABASE command is used to create a database
• CREATE TABLE command is used to create tables in a database
• MySQL workbench supports forward engineering which involves automatically generating SQL scripts from the logical
database model that can be executed to create the physical database

MS Excel Interview QnA:

What are the common data formats in Microsoft Excel?
“The most common data formats used in Microsoft Excel are numbers, percentages, dates and sometimes texts (as in
words and strings of texts).”
How are these data formats used in Microsoft Excel?
“Numbers can be formatted in data cells as decimals or round values. Percentages show a part of a whole and the whole
being 100%. The dates can automatically change depending on the region and location Microsoft Excel is connected from.
And the text format is used when analyses, reports or other documents are entered into the Excel spreadsheet as data.”
What are the cell references?
“Cell references are used to refer to data located in the same Excel spreadsheet but to data in a different cell. There are
three different cell reference types: Absolute, relative and mixed cell references.”
What are the functions of different cell references?

The absolute cell reference forces the data to stay in the cell which it was put in. No matter how many formulas are used
on the data itself, an absolute cell reference stays with the data.
The relative cell reference moves with the cell when the formula on the cell is moved to another one.
And the mixed cell reference indicates that the row or the column related to the data cell is changed or moved.
Which key or combination of keys allow you to toggle between the absolute, relative and mixed cell references?
A sample answer on the key to press to shift between the cell references can be: “In Windows devices the F4 key lets you
to change the cell references. In Mac devices the combination of keys Command + T will allow you this shift.”
What is the function of the dollar sign ($) in Microsoft Excel?
“The dollar sign when written tells Excel whether to change the location of the reference or not, if the formula for it is
copied to other cells.”
What is the LOOKUP function in Microsoft Excel?
“The LOOKUP function allows the user to find exact or partial matches in the spreadsheet. The VLOOKUP option lets the
user search for data located in the vertical position. The HLOOKUP option functions the same way but in the horizontal
plane.”

What is conditional formatting?
“Conditional formatting allows you to change the visual aspect of cells. For example, you want all the cells which include a
value of 3 to be highlighted with a yellow highlighter and made italic. Conditional formatting lets you achieve this action
in only seconds.”
What are the most important functions of Excel to you as a business analyst?
“I most often use the LOOKUP function; followed by COUNT and COUNTA functions. The IF and MAX and MIN functions
are also one of the ones I usually use.”
What does the COUNTA function execute in an Excel spreadsheet?
“The COUNTA can scan all the rows and columns that contain data, identify them and ignore the empty cells.”
Can you import data from other software into an Excel spreadsheet?
“Importing data from various external data sources into an Excel spreadsheet is available. Just go into the ‘Data tab above
in the toolbar. And by clicking the ‘Get External Data’ button, you will be able to import data from other software into
Excel.”

Why do you think the knowledge of Microsoft Excel is important for a business analyst?
Using Excel and dealing with the company’s data is crucial because it’s the only data the organization has. And it is in the
hands of the business analyst to analyze and come up with results and solutions for problems. The business analyst is also
the financial consultant as well as an analyst. You can become the person that the CEO listens to in order to ‘make’ or
‘break’ certain deals.”
As a business analyst, do you choose to store your sensitive data in a Microsoft Excel spreadsheet?
“Yes, I do store my client’s data in Microsoft Excel. However, if the data I am dealing with is confidential, then I would not
be storing that sensitive data in an Excel file.”
As a business analyst, how would you operate with sensitive data in Microsoft Excel?
“Due to the fact that I would be responsible for the transfer, the possible disappearance or the leak of the data, I would
store confidential data in a software other than Microsoft Excel.”
How can you protect your data in Microsoft Excel?
“From the Review tab, you can choose to protect your sheet with a password. That way the spreadsheet will be password
protected and cannot be opened or copied without the password.”

Python Interview QnA:
Q1. What built-in data types are used in Python?
Python uses several built-in data types, including:
Number (int, float and complex)
String (str)
Tuple (tuple)
Range (range)
List (list)
Set (set)
Dictionary (dict)
In Python, data types are used to classify or categorize data, and every value has a data type
Q2. How are data analysis libraries used in Python? What are some of the most common libraries?
A key reason Python is such a popular data science programming language is because there is an extensive collection of
data analysis libraries available. These libraries include functions, tools and methods for managing and analyzing data.
There are Python libraries for performing a wide range of data science functions, including processing image and textual
data, data mining and data visualization. The most widely used Python data analysis libraries include:

o Pandas
o NumPy
o SciPy
o TensorFlow
o SciKit
o Seaborn
o Matplotlib
Q3. How is a negative index used in Python?
Negative indexes are used in Python to assess and index lists and arrays from the end, counting backwards.
For example, n-1 will show the last item in a list, while n-2 will show the second to last. Here’s an example
of a negative index in Python:
b = "Python Coding Fun"
print(b[-1])
>> n

Q4. What’s the difference between lists and tuples in Python?
Lists and tuples are classes in Python that store one or more objects or values. Key differences include:
Syntax – Lists are enclosed in square brackets and tuples are enclosed in parentheses.
Mutable vs. Immutable – Lists are mutable, which means they can be modified after being created. Tuples
are immutable, which means they cannot be modified.
Operations – Lists have more functionalities available than tuples, including insert and pop operations and
sorting.
Size – Because tuples are immutable, they require less memory and are subsequently faster.
Python Statistics Questions
Python statistics questions are based on implementing statistical analyses and testing how well you know
statistical concepts and can translate them into code. Many times, these questions take the form of random
sampling from a distribution, generating histograms, and computing different statistical metrics such as
standard deviation, mean, or median.

Q5. Write a function to generate N samples from a normal distribution and plot them on the histogram.
This is a relatively simple problem. We have to set up our distribution and generate N samples from it, which
are then plotted. In this question, we make use of the SciPy library which is a library made for scientific
computing.
Q6. Write a function that takes in a list of dictionaries with a key and list of integers and returns a
dictionary with the standard deviation of each list.
Note that this should be done without using the NumPy built-in functions.
Example:

Hint: Remember the equation for standard deviation. To be able to fulfill this function, we need to use the
following equation, where we take the sum of the square of the data value minus the mean, over the total
number of data points, all in a square root. Does the function inside the square root look familiar?
Q7. Given a list of stock prices in ascending order by datetime, write a function that outputs the max
profit by buying and selling at a specific interval.
Example:
stock_prices = [10,5,20,32,25,12]
get_max_profit(stock_prices) -> 27
Making it harder, given a list of stock prices and date times in ascending order by datetime, write a function
that outputs the profit and start and end dates to buy and sell for max profit.

stock_prices = [10,5,20,32,25,12]
dts = [
'2019-01-01',
'2019-01-02',
'2019-01-03',
'2019-01-04',
'2019-01-05',
'2019-01-06',
]
get_profit_dates(stock_prices,dts) -> (27, '2019-01-02', '2019-01-04')
Hint: There are several ways to solve this problem. But a good place to start is by thinking about your goal: If
we want to maximize profit, ideally we would want to buy at the lowest price and sell at the highest possible
price.

Python Probability Questions
Most Python questions that involve probability are testing your knowledge of the probability concept. These
questions are really similar to the Python statistics questions except they are focused on simulating concepts
like Binomial or Bayes theorem.
Since most general probability questions are focused around calculating chances based on a certain
condition, almost all of these probability questions can be proven by writing Python to simulate the case
problem.
Q8. Amy and Brad take turns rolling a fair six-sided die. Whoever rolls a “6” first wins the game. Amy starts by
rolling first.
What’s the probability that Amy wins?
Given this scenario, we can write a Python function that can simulate this scenario thousands of times to
see how many times Amy wins first. Solving this problem then requires understanding how to create two
separate people and simulate the scenario of one person rolling first each time.

Name mutable and immutable objects.
The mutability of a data structure is the ability to change the portion of the data structure without having to
recreate it. Mutable objects are lists, sets, values in a dictionary.
Immutability is the state of the data structure that cannot be changed after its creation. Immutable objects are
integers, strings, float, bool, tuples, keys of a dictionary.
What are compound data types and data structures?
The data type that is constructed using simple, primitive, and basic data types are compound data types. Data
Structures in Python allow us to store multiple observations. These are lists, tuples, sets, and dictionaries.
List:
Lists are enclosed with in square []
Lists are mutable, that is their elements and size
can be changed.
Lists are slower than tuples.
Example: [‘A’, 1, ‘i’]
Tuple:
Tuples are enclosed in parentheses ()
Tuples are immutable i.e cannot be edited.
Tuples are faster than lists.
Tuples must be used when the order of the
elements of a sequence matters.
Example: (‘Twenty’, 20, ‘XX’)

What is the difference between a list and a tuple?
List:
Lists are enclosed with in square []
Lists are mutable, that is their elements and size can
be changed.
Lists are slower than tuples.
Example: [‘A’, 1, ‘i’]
What is the difference between is and ‘==’ ?
‘==’ checks for equality between the variables, and
‘is’ checks for the identity of the variables.
What is the difference between indexing and slicing?
Indexing is extracting or lookup one or particular
values in a data structure, whereas slicing retrieves a
sequence of elements
What is the lambda function?
Lambda functions are an anonymous or nameless function.
These functions are called anonymous because they are not
declared in the standard manner by using the def keyword. It
doesn’t require the return keyword as well. These are implicit in
the function.
The function can have any number of parameters but can have
just one statement and return just one value in the form of an
expression. They cannot contain commands or multiple
expressions.
An anonymous function cannot be a direct call to print because
lambda requires an expression.
Lambda functions have their own local namespace and cannot
access variables other than those in their parameter list and
those in the global namespace.
Example: x = lambda i,j: i+j
print(x(7,8))
Output: 15

Machine Learning:
What is a default value?
Default argument means the function will take the
default parameter value if the user has not given any
predefined parameter value.
What is the difference between lists and arrays?
An array is a data structure that contains a group of
elements where the elements are of the same data type,
e.g., integer, string. The array elements share the same
variable name, but each element has its own unique
index number or key. The purpose is to organize the data
so that the related set of values can be easily sorted or
searched.
How do I prepare for a Python interview?
There is no one way to prepare for the Python interview.
Knowing the basics can never be discounted. It is
necessary to know at least the following topics for
python interview questions for data science:
You must have command over the basic control flow that is
for loops, while loops, if-else-elif statements. You must know
how to write all these by hand.
A solid foundation of the various data types and data
structures that Python offers. How, where, when and why to
use each of the strings, lists, tuples, dictionaries, sets. It is
must to know how to iterate over each of these.
You must know how to use a list comprehension, dictionary
comprehension, and how to write a function. Use of lambda
functions especially with map, reduce and filter.
If needed you must be able to discuss how you have used
Python, where all have you used to solve common problems
such as Fibonacci series, generate Armstrong numbers.
Be thorough with Pandas, its various functions. Also, be well
versed with various libraries for visualization, scientific &
computational purposes, and Machine Learning.

What is Machine Learning?
Machine Learning is making the computer learn from studying data and statistics.
Machine Learning is a step into the direction of artificial intelligence (AI).
Machine Learning is a program that analyses data and learns to predict the outcome
Data Set
In the mind of a computer, a data set is any collection of data. It can be anything from an array to a complete database.
Example of an array:
[99,86,87,88,111,86,103,87,94,78,77,85,86]
Numerical data are numbers, and can be split into two numerical categories:
Discrete Data
- numbers that are limited to integers. Example: The number of cars passing by.
Continuous Data
- numbers that are of infinite value. Example: The price of an item, or the size of an item

What are Mean, Median, and Mode?
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
Mean - The average value.
To calculate the mean, find the sum of all values, and divide the sum by
the number of values:
Median
The median value is the value in the middle, after you have sorted all
the values:
Mode
The Mode value is the value that appears the most number of times:

What is Standard Deviation?
Standard deviation is a number that describes how spread out the values are.
A low standard deviation means that most of the numbers are close to the mean (average) value.
A high standard deviation means that the values are spread out over a wider range.
This time we have registered the speed of 7 cars:
speed = [86,87,88,86,87,85,86]
The standard deviation is:
0.9
Meaning that most of the values are within the range of 0.9 from the mean value, which is 86.4.
import numpy
speed = [86,87,88,86,87,85,86]
x = numpy.std(speed)
print(x)

Variance
Variance is another number that indicates how spread out the values are.
In fact, if you take the square root of the variance, you get the standard deviation!
Or the other way around, if you multiply the standard deviation by itself, you get the variance!
To calculate the variance you have to do as follows:1. Find the mean:
(32+111+138+28+59+77+97) / 7 = 77.4

Standard Deviation
As we have learned, the formula to find the standard deviation is the square root of the variance:
√1432.25 = 37.85
import numpy
speed = [32,111,138,28,59,77,97]
x = numpy.std(speed)
print(x)
What are Percentiles?
Percentiles are used in statistics to give you a number that describes the value that a given percent of the values are lower than.
Let's say we have an array of the ages of all the people that lives in a street.
What is the age that 90% of the people are younger than?
import numpy
ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]
x = numpy.percentile(ages, 90)
print(x)

Machine Learning QnA:

What is Machine Learning?
Machine Learning (ML) is that field of computer science with the help of which computer systems can provide sense to
data in much the same way as human beings do.
In simple words, ML is a type of artificial intelligence that extract patterns out of raw data by using an algorithm or
method. The main focus of ML is to allow computer systems learn from experience without being explicitly
programmed or human intervention.
What are Different Types of Machine Learning algorithms?
There are various types of machine learning algorithms. Here is the list of them in a broad category based on:
Whether they are trained with human supervision (Supervised, unsupervised, reinforcement learning)

What is Supervised Learning?
Supervised learning is a machine learning algorithm of inferring a function from labeled training data. The training data
consists of a set of training examples.
Example: 01
Knowing the height and weight identifying the gender of the person. Below are the popular supervised learning
algorithms.
• Support Vector Machines
• Regression
• Naive Bayes
• Decision Trees
• K-nearest Neighbour Algorithm and Neural Networks.
Example: 02
If you build a T-shirt classifier, the labels will be “this is an S, this is an M and this is L”, based on showing the classifier
examples of S, M, and L.

What is Unsupervised Learning?
Unsupervised learning is also a type of machine learning algorithm used to find patterns on the set of data given. In this,
we don’t have any dependent variable or label to predict. Unsupervised Learning Algorithms:
Clustering,
Anomaly Detection,
Neural Networks and Latent Variable Models.
Example:
In the same example, a T-shirt clustering will categorize as “collar style and V neck style”, “crew neck style” and “sleeve
types”.
What is PCA? When do you use it?
Principal component analysis (PCA) is most commonly used for dimension reduction.
In this case, PCA measures the variation in each variable (or column in the table). If there is little variation, it throws the
variable out, as illustrated in the figure below:

What is Cross-Validation?
Cross-validation is a method of splitting all your data
into three parts: training, testing, and validation data.
Data is split into k subsets, and the model has trained
on k-1of those datasets.
The last subset is held for testing. This is done for each
of the subsets. This is k-fold cross-validation. Finally,
the scores from all the k-folds are averaged to produce
the final score.

What is Bias in Machine Learning?
Bias in data tells us there is inconsistency in data. The inconsistency may occur for several reasons which are not mutually
exclusive.
For example, a tech giant like Amazon to speed the hiring process they build one engine where they are going to give 100
resumes, it will spit out the top five, and hire those.
When the company realized the software was not producing gender-neutral results it was tweaked to remove this bias.
Explain the Difference Between Classification and Regression?
Classification is used to produce discrete results, classification is used to classify data into some specific categories.
For example, classifying emails into spam and non-spam categories.
Whereas, regression deals with continuous data.
For example, predicting stock prices at a certain point in time.
Classification is used to predict the output into a group of classes.
For example, Is it Hot or Cold tomorrow?
Whereas, regression is used to predict the relationship that data represents.
For example, What is the temperature tomorrow?

How to Tackle Overfitting and Underfitting?
Overfitting means the model fitted to training data too well, in this case, we need to resample the data and estimate the model
accuracy using techniques like k-fold crossvalidation.Whereas for the Underfitting case we are not able to understand or capture
the patterns from the data, in this case, we need to change the algorithms, or we need to feed more data points to the model.
What is a Neural Network?
It is a simplified model of the human brain. Much like the brain, it has neurons that activate when encountering something
similar. The different neurons are connected via connections that help information flow from one neuron to another.
How to Handle Outlier Values?
An Outlier is an observation in the dataset that is far away from other observations in the dataset. Tools used to discover
outliers are:
Box plot
Z-score
Scatter plot, etc.
Typically, we need to follow three simple strategies to handle outliers:
We can drop them. We can mark them as outliers and include them as a feature. Likewise, we can transform the feature to
reduce the effect of the outlier.

What is a Random Forest? How does it work?
Random forest is a versatile machine learning method capable of performing both regression and classification tasks.
Like bagging and boosting, random forest works by combining a set of other tree models. Random forest builds a tree from a
random sample of the columns in the test data.
Here’s are the steps how a random forest creates the trees:
o Take a sample size from the training data.
o Begin with a single node.
o Run the following algorithm, from the start node:
• If the number of observations is less than node size then stop.
• Select random variables.
• Find the variable that does the “best” job of splitting the observations.
• Split the observations into two nodes.
• Call step `a` on each of these nodes
What is Clustering?
Clustering is the process of grouping a set of objects into a number of groups. Objects should be similar to one another within
the same cluster and dissimilar to those in other clusters.
A few types of clustering are:
Hierarchical clustering, K means clustering, Density-based clustering, Fuzzy clustering, etc.

How can you select K for K-means Clustering?
There are two kinds of methods that include direct methods and statistical testing methods:
o Direct methods: It contains elbow and silhouette
o Statistical testing methods: It has gap statistics.
The silhouette is the most frequently used while determining the optimal value of k.
What are Recommender Systems?
A recommendation engine is a system used to predict users’ interests and recommend products that are quite likely interesting
for them.
Data required for recommender systems stems from explicit user ratings after watching a film or listening to a song, from
implicit search engine queries and purchase histories, or from other knowledge about the users/items themselves
Explain Correlation and Covariance?
Correlation is used for measuring and also for estimating the quantitative relationship between two variables. Correlation
measures how strongly two variables are related. Examples like, income and expenditure, demand and supply, etc.
Covariance is a simple way to measure the correlation between two variables. The problem with covariance is that they are hard
to compare without normalization.

What is P-value?
P-values are used to make a decision about a hypothesis test. P-value is the minimum significant level at which you can reject
the null hypothesis. The lower the p-value, the more likely you reject the null hypothesis.
What are Parametric and Non-Parametric Models?
Parametric models will have limited parameters and to predict new data, you only need to know the parameter of the model.
Non-Parametric models have no limits in taking a number of parameters, allowing for more flexibility and to predict new data.
You need to know the state of the data and model parameters.
What is Reinforcement Learning?
Reinforcement learning is different from the other types of learning like supervised and unsupervised. In reinforcement learning,
we are given neither data nor labels. Our learning is based on the rewards given to the agent by the environment.

Numpy and Pandas QnA:
Define Python Pandas.
Pandas refer to a software library explicitly written for Python, which is used to analyze and manipulate data.
Pandas can be installed using pip or Anaconda distribution. Pandas make it very easy to perform machine learning operations
on tabular data.
What Are The Different Types Of Data Structures In Pandas?
Panda library supports two major types of data structures, DataFrames and Series. Both these data structures are built on the
top of NumPy. Series is a one dimensional and simplest data structure, while DataFrame is two dimensional. Another axis label
known as the “Panel” is a 3-dimensional data structure and includes items such as major_axis and minor_axis.
Explain Series In Pandas.
Series is a one-dimensional array that can hold data values of any type (string, float, integer,
python objects, etc.). It is the simplest type of data structure in Pandas; here, the data’s axis
labels are called the index.
Define Dataframe In Pandas.
A DataFrame is a 2-dimensional array in which data is aligned in a tabular form with rows and
columns. With this structure, you can perform an arithmetic operation on rows and columns.

How Can You Create An Empty Dataframe In Pandas?
To create an empty DataFrame in Pandas, type
import pandas as pd
ab = pd.DataFrame()
What Are The Most Important Features Of The Pandas Library?
Important features of the panda’s library are:
• Data Alignment
• Merge and join
• Memory Efficient
• Time series
• Reshaping
What are the different ways of creating DataFrame in pandas?
Explain with examples.
DataFrame can be created using Lists or Dict of nd arrays.
How Will You Explain Re-indexing In Pandas?
To re-index means to modify the data to match a
particular set of labels along a particular axis.
Various operations can be achieved using indexing,
such as-
Insert missing value (NA) markers in label locations
where no data for the label existed.
Reorder the existing set of data to match a new set of
labels.
Create A Series Using Dict In Pandas.
import pandas as pd
import numpy as np
ser = {‘a’ : 1, ‘b’ : 2, ‘c’ : 3}
ans = pd.Series(ser)
print (ans)

What are the different ways of creating DataFrame in pandas? Explain with
examples.
DataFrame can be created using Lists or Dict of nd arrays.
Example 1 – Creating a DataFrame using List
import pandas as pd
# a list of strings
Strlist = [‘Pandas’, ‘NumPy’]
# Calling DataFrame constructor on the list
list = pd.DataFrame(Strlist)
print(list)
Example 2 – Creating a DataFrame using dict of arrays
import pandas as pd
list = {‘ID’: [1001, 1002, 1003],’Department’:[‘Science’, ‘Commerce’, ‘Arts’,]}
list = pd.DataFrame(list)
print (list)
How Can You Iterate Over Dataframe In Pandas?
To iterate over DataFrame in pandas for loop can
be used in combination with an iterrows () call.
What Is Pandas Numpy Array?
Numerical Python (NumPy) is defined as an inbuilt
package in python to perform numerical
computations and processing of multidimensional
and single-dimensional array elements.
NumPy array calculates faster as compared to
other Python arrays.
How Can A Dataframe Be Converted To An Excel
File?
To convert a single object to an excel file, we can
simply specify the target file’s name. However, to
convert multiple sheets, we need to create an
ExcelWriter object along with the target filename
and specify the sheet we wish to export.

What is the main use of NumPy?
NumPy can be used to perform a wide variety of mathematical operations on arrays. It adds powerful data structures to
Python that guarantee efficient calculations with arrays and matrices and it supplies an enormous library of high-level
mathematical functions that operate on these arrays and matrices.
Why NumPy is faster than list?
NumPy Arrays are faster than Python Lists because of the following reasons: An array is a collection of homogeneous data-
types that are stored in contiguous memory locations. On the other hand, a list in Python is a collection of heterogeneous
data types stored in non-contiguous memory locations
What is difference between NumPy and Pandas?
Difference between Pandas and NumPy: … NumPy library provides objects for multi-dimensional arrays, whereas Pandas is
capable of offering an in-memory 2d table object called DataFrame. NumPy consumes less memory as compared to Pandas.
Indexing of the Series object.

List the steps to create a 1D array and 2D array
A one-dimensional array is created as follows:
num=[1,2,3]
num = np.array(num)
print(“1d array : “,num)
A two-dimensional array is created as follows:
num2=[[1,2,3],[4,5,6]]
num2 = np.array(num2)
print(“n2d array : “,num2)
How do you create a 3D array?
A three-dimensional array is created as follows:
num3=[[[1,2,3],[4,5,6],[7,8,9]]]
num3 = np.array(num3)
print(“n3d array : “,num3)
What is the procedure to find the indices of an array on NumPy
where some condition is true?
You may use the function numpy.nonzero() to find the indices or an
array. You can also use the nonzero() method to do so.
In the following program, we will take an array a, where the
condition is a > 3. It returns a boolean array. We know False on
Python and NumPy is denoted as 0. Therefore, np.nonzero(a > 3)
will return the indices of the array a where the condition is True.
>>> import numpy as np
>>> a = np.array([[1,2,3],[4,5,6],[7,8,9]])
>>> a > 3
array([[False, False, False],
[ True, True, True],
[ True, True, True]], dtype=bool)
>>> np.nonzero(a > 3)
(array([1, 1, 1, 2, 2, 2]), array([0, 1, 2, 0, 1, 2]))
You can also call the nonzero() method of the boolean array.
>>> (a > 3).nonzero()
(array([1, 1, 1, 2, 2, 2]), array([0, 1, 2, 0, 1, 2]))

Data Visualization
What is the difference between various BI tools and Tableau?
The basic difference between the traditional BI tools and Tableau lies in the efficiency and speed.
The architecture of Traditional BI tools has hardware limitations. While Tableau does not have any sort of dependencies
The traditional BI tools work on complex technologies while Tableau uses simple associative search to make it dynamic.
Traditional BI tools do not support multi-thread, in-memory, or multi-core computing while Tableau supports all these
features after integrating complex technologies.
Traditional BI tools have a pre-defined data view while Tableau does a predictive analysis for business operations.
What are different Tableau products?
Tableau like other BI tools has a range of products:
Tableau Desktop: Desktop product is used to create optimized queries out from pictures of data. Once the queries are ready,
you can perform those queries without the need to code. Tableau desktop encompasses data from various sources into its
data engine and creates an interactive dashboard.
Tableau Server: When you have published dashboards using Tableau Desktop, Tableau servers help in sharing them
throughout the organization. It is an enterprise-level feature that is installed on a Windows or Linux server.

Tableau Reader: Tableau Reader is a free feature available on Desktop that lets you open and views data visualizations. You
can filter or drill down the data but restricts editing any formulas or performing any kind of actions on it. It is also used to
extract connection files.
Tableau Online: Tableau online is also a paid feature but doesn’t need exclusive installation. It comes with the software and
is used to share the published dashboards anywhere and everywhere.
Tableau Public: Tableau public is yet another free feature to view your data visualizations by saving them as worksheets or
workbooks on Tableau Server.
What is a parameter in Tableau?
The parameter is a variable (numbers, strings, or date) created to replace a constant value in calculations, filters, or
reference lines. For example, you create a field that returns true if the sales are greater than 30,000 and false if otherwise.
Parameters are used to replace these numbers (30000 in this case) to dynamically set this during calculations. Parameters
allow you to dynamically modify values in a calculation. The parameters can accept values in the following options:
All: Simple text field
List: List of possible values to select from
Range: Select values from a specified range

Tell me something about measures and dimensions?
In Tableau, when we connect to a new data source, each field in the data source is either mapped
as measures or dimensions. These fields are the columns defined in the data source. Each field is assigned a dataType
(integer, string, etc.) and a role (discrete dimension or continuous measure).
Measures contain numeric values that are analyzed by a dimension table. Measures are stored in a table that allows storage
of multiple records and contains foreign keys referring uniquely to the associated dimension tables.
While Dimensions contain qualitative values (name, dates, geographical data) to define comprehensive attributes to
categorize, segment, and reveal the data details.
What are continuous and discrete field types?
Tableau’s specialty lies in displaying data differently either in continuous format or discrete. Both of them are mathematical
terms used to define data where continuous means without interruptions and discrete means are individually separate and
distinct.
While the blue color indicates discrete behavior, the green color indicates continuous behavior. On one hand, the discrete
view defines the headers and can be easily sorted, while continuous defines the axis in a graph view and cannot be sorted.

What is aggregation and disaggregation of data?
Aggregation of data means displaying the measures and dimensions in an aggregated form. The aggregate functions available
in the Tableau tool are:
SUM (expression): Adds up all the values used in the expression. Used only for numeric values.
AVG (expression): Calculates the average of all the values used in the expression. Used only for numeric values.
Median (expression): Calculates the median of all the values across all the records used in the expression. Used only for
numeric values.
Count (expression): Returns the number of values in the set of expressions. Excludes null values.
Count (distinct): Returns the number of unique values in the set of expressions.
Disaggregation of data means displaying each and every data field separately
Tell me the different connections to make with a dataset?
There are two types of data connections in Tableau:
LIVE: Live connection is a dynamic way to extract real-time data by directly connecting to the data source. Tableau directly
creates queries against the database entries and retrieves the query results in a workbook.
EXTRACT: A snapshot of the data, extract the file (.tde or .hyper file) contains data from a relational database. The data is
extracted from a static source of data like an Excel Spreadsheet. You can schedule to refresh the snapshots which are done
using the Tableau server. This doesn’t need any connection with the database.

What are the different types of joins in Tableau?
Tableau is pretty similar to SQL. Therefore, the types of joins in Tableau are similar:
• Left Outer Join: Extracts all the records from the left table and the matching rows from the right table.
• Right Outer Join: Extracts all the records from the right table and the matching rows from the left table.
• Full Outer Join: Extracts the records from both the left and right tables. All unmatched rows go with the NULL value.
• Inner Join: Extracts the records from both tables.

Django
What are Django's most prominent features?
Programmers like Django mostly for its convenient features like:
• Optimized for SEO
• Extremely fast
• A loaded framework that features authentications, content administrations and RSS feeds
• Exceptionally scalable to meet the heaviest traffic demand
• Highly secure
• Versatility, enabling you to create many different types of websites
Can you name some companies that use Django?
Some of the more well-known companies that use Django include:’
• DISCUS
• Instagram
• Mozilla Firefox
• Pinterest
• Reddit
• YouTube

Django
Why do web developers prefer Django?
Web developers use Django because it:
• Allows code modules to be divided into logical groups, making them flexible to change
• Provides an auto-generated web admin module to ease website administration
• Provides a pre-packaged API for common user tasks
• Enables developers to define a given function’s URL
• Allows users to separate business logic from the HTML
• Is written in Python, one of the most popular programming languages available today
• Gives you a system to define the HTML template for your web page, avoiding code duplication
What is CRUD?
It has nothing to do with dirt or grime. It’s a handy acronym for Create, Read, Update, and Delete. It’s a mnemonic
framework used to remind developers on how to construct usable models when building application programming
interfaces (APIs).

Django
What does Django architecture look like?
Django architecture consists of:
• Models. Describes the database schema and data structure
• Views. Controls what a user sees. The view retrieves data from appropriate models, executes any calculations
made, and passes it on to the template
• Templates. Controls how the user sees the pages. It describes how the data received from the views need to be
altered or formatted to display on the page
• Controller. Made up of the Django framework and URL parsing
In Django’s context, what’s the difference between a project and an app?
The project covers the entire application, while an app is a module or application within the project that deals with
one dedicated requirement. So, a project consists of several apps, while an app features in multiple projects.

Django
What’s a model in Django?
A model consists of all the necessary fields and attributes of your stored data. They are a single, definitive source of
information regarding your data.
What are Django’s templates?
Django templates render information in a designer-friendly format to present to the user. Using the Django Template
Language (DTL), a user can generate HTML dynamically. Django templates consist of simple text files that can create
any text-based format such as XML, CSV, and HTML.
Discuss Django’s Request/Response Cycle.
Starting the process off, the Django server receives a request. The server then looks for a matching URL in the URL
patterns defined for the project. If the server can’t find a matching URL, it produces a 404-status code. If the URL
matches, it executes the corresponding code in the view file associated with the URL and sends a response.

Django
What is the Django Admin interface?
Django comes equipped with a fully customizable, built-in admin interface. This portal lets developers see and make
changes to all the data residing in the database that contains registered apps and models. The model must be
registered in the admin.py file to use a database table with the admin interface.
How do you install Django?
Users download and install Python per the operating system used by the host machine. Then run the command pip
install “django>=2.2,<3” on the terminal and wait for the installation to finish.
How do you check which version of Django that you have installed on your system?
You can check the version by opening the command prompt and entering the command:
Python-m Django–version
What are signals in Django?
Signals are pieces of code containing information about what is currently going on. A dispatcher is used to both send
and listen for signals.

Django
What is the Django Rest Framework?
The Django Rest Framework (DRF) is a framework that helps you quickly create RESTful APIs. They are ideal for web
applications due to low bandwidth utilization.
What do you use middleware for in Django?
You use middleware for four different functions:
• Content Gzipping
• Cross-site request forgery protection
• Session management
• Use authentication
What does a URLs-config file contain?
The URLs-config file in Django contains a list of URLs and mappings created to view those URLs' functions. The URLs can
map to view functions, class-based views, and the URLs-config of other applications.
Does Django support multiple-column primary keys?
No, Django supports only single-column primary keys.

Data Science Mock Interview
What is the difference between data science and big data?

What are Recommender Systems?
Ans. Recommender systems are a subclass of information filtering systems, used to predict how users would rate or score
particular objects (movies, music, merchandise, etc.). Recommender systems filter large volumes of information based on
the data provided by a user and other factors, and they take care of the user’s preference and interest.
Recommender systems utilize algorithms that optimize the analysis of the data to build the recommendations. They ensure
a high level of efficiency as they can associate elements of our consumption profiles such as purchase history, content
selection, and even our hours of activity, to make accurate recommendations
What are the different types of Recommender Systems?
Ans. There are three main types of Recommender systems.
User-User collaborative filtering
Item-Item collaborative filtering

Differentiate between wide and long data formats.
In a wide format, categorical data are always grouped.
The long data format is in which there are a number of instances with many variables and subject variables.
What are Interpolation and Extrapolation?
Interpolation – This is the method to guess data points between data sets. It is a prediction between the given data points.
Extrapolation – This is the method to guess data point beyond data sets. It is a prediction beyond given data points.
How much data is enough to get a valid outcome?
Ans. All the businesses are different and measured in different ways. Thus, you never have enough data and there will be no
right answer. The amount of data required depends on the methods you use to have an excellent chance of obtaining vital
results.
What is the difference between ‘expected value’ and ‘average value’?
Ans. When it comes to functionality, there is no difference between the two. However, they are used in different situations.
An expected value usually reflects random variables, while the average value reflects the population sample.

What happens if two users access the same HDFS file at the same time?
Ans. This is a bit of a tricky question. The answer itself is not complicated, but it is easy to confuse by the similarity of
programs’ reactions. When the first user is accessing the file, the second user’s inputs will be rejected because HDFS
NameNode supports exclusive write.
What is the importance of statistics in data science?
Ans. Statistics help data scientists to get a better idea of a customer’s expectations. Using statistical methods, data Scientists
can acquire knowledge about consumer interest, behavior, engagement, retention, etc. It also helps to build robust data
models to validate certain inferences and predictions.
What are the different statistical techniques used in data science?
• There are many statistical techniques used in data science, including –
• The arithmetic mean – It is a measure of the average of a set of data
• Graphic display – Includes charts and graphs to visually display, analyze, clarify, and interpret numerical data through
histograms, pie charts, bars, etc.
• Correlation – Establishes and measures relationships between different variables
• Regression – Allows identifying if the evolution of one variable affects others

• Time series – It predicts future values by analyzing sequences of past values
• Data mining and other Big Data techniques to process large volumes of data
• Sentiment analysis – It determines the attitude of specific agents or people towards an issue, often using data from social
networks
• Semantic analysis – It helps to extract knowledge from large amounts of texts
• A / B testing – To determine which of two variables works best with randomized experiments
• Machine learning using automatic learning algorithms to ensure excellent performance in the presence of big data
What is an RDBMS? Name some examples for RDBMS?
A relational database management system (RDBMS) is a database management system that is based on a relational model.
Some examples of RDBMS are MS SQL Server, IBM DB2, Oracle, MySQL, and Microsoft Access.
What are a Z test, Chi-Square test, F test, and T-test?
Z test is applied for large samples. Z test = (Estimated Mean – Real Mean)/ (square root real variance / n).
Chi-Square test is a statistical method assessing the goodness of fit between a set of observed values and those expected
theoretically.
F-test is used to compare 2 populations’ variances. F = explained variance/unexplained variance.
T-test is applied for small samples. T-test = (Estimated Mean – Real Mean)/ (square root Estimated variance / n).
What does P-value signify about the statistical data?
The p-value is the probability for a given statistical model that, when the null hypothesis is true, the statistical summary would
be the same as or more extreme than the actual observed results.

When,
P-value>0.05, it denotes weak evidence against the null hypothesis which means the null hypothesis cannot be rejected.
P-value <= 0.05 denotes strong evidence against the null hypothesis which means the null hypothesis can be rejected.
P-value=0.05is the marginal value indicating it is possible to go either way
Differentiate between univariate, bivariate, and multivariate analysis.
Univariate analysis is the simplest form of statistical analysis where only one variable is involved.
Bivariate analysis is where two variables are analyzed and in multivariate analysis, multiple variables are examined.
What is association analysis? Where is it used?
Association analysis is the task of uncovering relationships among data. It is used to understand how the data items are
associated with each other.
What is the difference between squared error and absolute error?
Squared error measures the average of the squares of the errors or deviations—that is, the difference between the estimator
and what is estimated. Absolute error is the difference between the measured or inferred value of a quantity and its actual
value.
What is an API? What are APIs used for?
API stands for Application Program Interface and is a set of routines, protocols, and tools for building software applications.
With API, it is easier to develop software applications.

What is Collaborative filtering?
Collaborative filtering is a method of making automatic predictions by using the recommendations of other people.
What is market basket analysis?
Market Basket Analysis is a modeling technique based upon the theory that if you buy a certain group of items, you are more (or
less) likely to buy another group of items.
What is the central limit theorem?
The central limit theorem states that the distribution of an average will tend to be Normal as the sample size increases,
regardless of the distribution from which the average is taken except when the moments of the parent distribution do not exist.
Explain the difference between type I and type II errors.
Type I error is the rejection of a true null hypothesis or false-positive finding, while Type II error is the non-rejection of a false
null hypothesis or false-negative finding.
What is Linear Regression?
It is one of the most commonly asked networking interview questions.
Linear regression is the most popular type of predictive analysis. It is used to model the relationship between a scalar response
and explanatory variables.

What are the limitations of a Linear Model/Regression?
Linear models are limited to linear relationships, such as dependent and independent variables
Linear regression looks at a relationship between the mean of the dependent variable and the independent variables, and not
the extremes of the dependent variable
Linear regression is sensitive to univariate or multivariate outliers
Linear regression tend to assume that the data are independent
What is the goal of A/B Testing?
Ans. A/B testing is a comparative study, where two or more variants of a page are presented before random users and their
feedback is statistically analyzed to check which variation performs better.
What is the main difference between overfitting and underfitting?
Ans. Overfitting – In overfitting, a statistical model describes any random error or noise, and occurs when a model is super
complex. An overfit model has poor predictive performance as it overreacts to minor fluctuations in training data.
Underfitting – In underfitting, a statistical model is unable to capture the underlying data trend. This type of model also shows
poor predictive performance.
What is a Gaussian distribution and how it is used in data science?
Ans. Gaussian distribution or commonly known as bell curve is a common probability distribution curve.

Explain the purpose of group functions in SQL. Cite certain examples of group functions.
Group functions provide summary statistics of a data set. Some examples of group functions are –
a) COUNT
b) MAX
c) MIN
d) AVG
e) SUM
f) DISTINCT
What is the difference between a Validation Set and a Test Set?
The validation set is used to minimize overfitting. This is used in parameter selection, which means that it helps to
verify any accuracy improvement over the training data set. Test Set is used to test and evaluate the performance of a trained
Machine Learning model.
What is the p-value?
Ans. A p-value helps to determine the strength of results in a hypothesis test. It is a number between 0 and 1 and Its value
determines the strength of the results.
What do you mean by logistic regression?
Also known as the logit model, logistic regression is a technique to predict the binary result from a linear amalgamation of
predictor variables.

What happens if two users access the same HDFS file at the same time?
This is a bit of a tricky question. The answer itself is not complicated, but it is easy to confuse by the similarity of programs’
reactions. When the first user is accessing the file, the second user’s inputs will be rejected because HDFS NameNode
supports exclusive write.
What is Correlation Analysis?
Ans. Correlation Analysis is a statistical method to evaluate the strength of the relationship between two quantitative
variables. It consists of autocorrelation coefficients, estimated and calculated to make a different spatial relationship. It is used
to correlate data based on distance.
What is the difference between a bar graph and a histogram?
In bar charts, each column represents a group defined by a categorical
variable; and with histograms, each column represents a group defined by
a quantitative variable
Which technique is used to predict categorical responses?
Classification techniques are used to predict categorical responses.
What libraries do data scientists use to plot data in Python?
Matplotlib is the main library used to plot data in Python. However, graphics created with this library need a lot of tweaking to
make them look bright and professional. For that reason, many data scientists prefer Seaborn, which allows you to create
attractive and meaningful charts with just one line of code.

What are lambda functions?
Ans. Lambda functions are anonymous functions in Python. They are very useful when you need to define a function that is very
short and consists of a single expression. So instead of formally defining the little function with a specific name, body, and
return statement, you can write everything in a short line of code using a lambda function.
What is Tensorflow?
Google Brain Team developed Tensorflow, which is a free software designed for numerical computation using graphs.
What is PyTorch?
PyTorch is a Python-based scientific computing package designed to perform numerical calculations using the programming of
tensors. PyTorch is designed to seamlessly integrate with Python and its popular libraries like NumPy and is easier to learn than
other Deep Learning frameworks.
What packages are used for data mining in Python and R?
Ans. There are various packages in Python and R:
 Python – Orange, Pandas, NLTK, Matplotlib, and Scikit-learn are some of them.
 R – Arules, tm, Forecast, and GGPlot are some of the packages.

Explain the difference between lists and tuples.
Both lists and tuples are made up of elements, which are values of any Python data type. However, these data types have a
number of differences:
• Lists are mutable, while tuples are immutable.
• Lists are created in brackets (for example, my_list = [a, b, c]), while tuples are in parentheses (for example, my_tuple = (a, b, c)).
• Lists are slower than tuples.
What is hypothesis testing?
Hypothesis testing is an important aspect of any testing procedure in Machine Learning or Data Science to analyze various factors
that may have any impact on the outcome of the experiment.
What is Pattern Recognition?
Ans. Pattern recognition is the process of data classification that includes pattern recognition and identification of data
regularities. This methodology involves the extensive use of machine learning algorithms.
When do you need to update the algorithm in Data science?
Ans. You need to update an algorithm in the following situation:
You want your data model to evolve as data streams using infrastructure
The underlying data source is changing
If it is non-stationarity

Why should you perform dimensionality reduction before fitting an SVM?
These SVMs tend to perform better in reduced space. If the number of features is large as compared to the number of
observations, then we should perform dimensionality reduction before fitting an SVM.
What is the Hierarchical Clustering Algorithm?
Ans. Hierarchical grouping algorithm combines and divides the groups that already exist, in this way they create a hierarchical
structure that presents the order in which the groups are split or merged.
Have you contributed to any open source project?
You must say specifically which projects you have worked on and what was their objective. A good answer would also include
what you have learned from participating in open source projects.
What is a decision tree method?
Ans. The Decision Tree method is an analytical method that facilitates better decisions making through a schematic
representation of the available alternatives. These decisions trees are very helpful when there are risks, costs, benefits, and
multiple options involved. The name is derived from the appearance of the model similar to a tree and its use is widespread in
the field of decision making under uncertainty (Decision Theory).
Why is natural language processing important?
Ans. NLP helps computers communicate with humans in their language and scales other language-related tasks. It contributes
towards structuring a highly unstructured data source.

o Content categorization – Generate a linguistics-based summary of the document, including search and indexing, content
alerts, and duplication detection.
o Sentiment analysis – Identification of mood or subjective opinions in large amounts of text, including sentiment mining and
average opinions.
o Speech-to-text and text-to-speech conversion – Transformation of voice commands into written text and vice versa.
o Document summarization – Automatic generation of synopses of large bodies of text.
o Machine-based translation – Automatic translation of text or speech from one language to another.
What is the importance of the decision tree method?
The decision tree method mitigates the risks of unforeseen consequences and allows you to include smaller details that will
lead you to create a step-by-step plan. Once you choose your path, you only need to follow it. Broadly speaking, this is a perfect
technique for –
o Analyzing problems from different perspectives
o Evaluating all possible solutions
o Estimating the business costs of each decision
o Making reasoned decisions with real and existing information about any company
o Analyzing alternatives and probabilities that result in the success of a business
What is the main difference between supervised and unsupervised machine learning?
Ans. Supervised learning includes training labeled data for a range of tasks such as data classification, while unsupervised
learning does not require explicitly labeling data.

What is data visualization?
Data visualization is the process of presenting datasets and other information through visual mediums like charts, graphs, and
others. It enables the user to detect patterns, trends, and correlations that might otherwise go unnoticed in traditional reports,
tables, or spreadsheets.
What is KNN?
K-Nearest Neighbour or KNN is a simple Machine Learning algorithm that is based on the Supervised Learning method. It
assumes the similarity between the new case/data and available cases, and puts the new case, closest to the available
categories.
What is Gradient Descent?
Gradient Descent is a popular algorithm used for training Machine Learning models and finding the values of parameters of a
function (f), which helps to minimize a cost function.
Before you appear for your data science interview, make sure you have –
Researched the role and the related skills required for the role
Brushed up your learning, read through the concepts and went through the projects you have worked
on
Participated in mock interview to prepare yourself better
Reviewed on your past experience and achievements and made a gist of those

Projects for Interview
CASE 1:
The best way to find out whether a hotel is right for you or not
is to find out what people are saying about the hotel who have
stayed there before. Now it is very difficult to read the
experience of each person who has given their opinion on the
services of the hotel. This is where the task of sentiment
analysis comes in. Here, I will walk you through the task of
Hotel Reviews Sentiment Analysis with Python.
Hotel Reviews Sentiment Analysis with Python
The dataset that I am using for the task of Hotel Reviews
sentiment analysis is collected from Kaggle. It contains
data about 20,000 reviews of people about the services of
hotels they stayed in for a vacation, business trip, or any
kind of trip. This dataset only contains two columns as
Reviews and Ratings of the customers. So let’s get started
with the task of Hotel Reviews sentiment analysis with
Python by importing the necessary Python libraries and
the dataset:

CASE 1:
This dataset is very large and luckily there are no missing
values so without wasting any time let’s take a quick look at
the distribution of customer ratings:

CASE 1:

CASE 1:
According to the reviews, hotel guests seem satisfied with the services, now let’s take a look at how most people think
about hotel services based on the sentiment of their review
Thus, most people feel neutral about the hotel services.
Now let’s take a closer look at sentiment scores:

CASE 1:
Thus, according to the above results, more than 12,000 reviews are
classified as neutral, more than 6,000 reviews are classified as positive.
So it can be said that people are really happy with the services of the
hotels they have stayed in as the negative reviews are below 1500.
Summary
This is how you can analyze the sentiments of hotel reviews. The best way to know if a hotel is right for you or
not is to find out what people are saying about the hotel who have stayed there before. This is where the task
of hotel reviews sentiment analysis can help you decide whether or not a hotel is suitable for your trip. Hope
you liked this article on sentiment analysis of hotel reviews with Python.

CASE 2:
In this Data Science Project we will create a Linear Regression model and a Decision Tree
Regression Model to Predict Apple’s Stock Price using Machine Learning and Python.

CASE 2: Customer Personality Analysis with Python
Customer personality analysis helps a business to modify its product based on its target customers from different types of
customer segments. For example, instead of spending money to market a new product to every customer in the company’s
database, a company can analyze which customer segment is most likely to buy the product and then market the product only
on that particular segment.
The most important part of a customer personality analysis is getting the answers to questions such as:
What people say about your product: what gives customers’ attitude towards the product?
What people do: which reveals what people are doing rather than what they are saying about your product?
I’ll walk you through a data science project on analyzing customer personality with python. Here I will be using a dataset that
contains data collected from a marketing campaign, where our task is to predict how different customer segments will respond
for a particular product or service.

Customer Personality Analysis with Python
Now let’s start with the task of customer
personality analysis with Python. Since this is a
segmentation task, we will use clustering to
summarize customer segments and then we will
also use the Apriori algorithm here. Now let’s start
by importing the necessary Python libraries and
the dataset:

In the code section below, I will first normalize the data and then I will create customer clustering according to the metrics
defined above:

Data Preparation for Customer Personality Analysis
Now I will prepare the data for the Apriori algorithm. Here I will be defining three segments of the customers according to the
age, income and seniority:

Apriori Algorithm
The Apriori algorithm is the simplest technique to identify the underlying relationships between different types of elements.
The idea behind this algorithm is that all nonempty subsets of a frequent category must also be frequent. Here I will be using
the Apriori algorithm for the task of customer personality analysis with Python. Here I will use this algorithm to identify the
biggest customer of wines:

CASE 3: Fake Currency Detection with Machine Learning
Fake Currency Detection is a real problem for both individuals and businesses. Counterfeiters are constantly finding new
methods and techniques to produce counterfeit banknotes, which are essentially indistinguishable from real money. At least
for the human eye. In this article, I will introduce you to Fake Currency Detection with Machine Learning.
Fake Currency Detection is a task of binary classification in machine learning. If we have enough data on real and fake
banknotes, we can use that data to train a model that can classify the new banknotes as real or fake.
The dataset I will use in this task for fake currency detection can be downloaded from here. The dataset contains these four
input characteristics:
• The variance of the image transformed into wavelets
• The asymmetry of the image transformed into wavelets
• Kurtosis of the image transformed into wavelets
• Image entropy

The target value is simply 0 for real banknotes and 1 for fake banknotes. Now let’s get started with this task of Fake Currency
Detection with Machine Learning. I will start this task by importing the necessary packages:

Now, let’s have a look at the dataset, the data does not have headings so I will also assign headings in the process
and then I will print the first 5 rows from the data:

From this pair plot we can make several
interesting observations:
• The distribution of both variance and
skewness appears to be quite different
for the two target characteristics, while
kurtosis and entropy appear to be more
similar.
• There are clear linear and nonlinear
trends in the input features.
• Some characteristics seem to be
correlated.
• Some features seem to separate genuine
and fake banknotes quite well.

Data Processing
Now we need to balance our data, the easiest way to do this is
to randomly drop a number of instances of the overrepresented
target function. This is called random undersampling.
Otherwise, we could also create new synthetic data for the
under-represented target class. This is called oversampling. For
now, let’s start by randomly deleting 152 observations of actual
banknotes:
Now we have a perfectly balanced
dataset. Next, we need to divide the data
into training and test sets:

CASE 3:

E-commerce Website using Django
This project deals with developing a Virtual website ‘E-commerce Website’. It provides the user with a list of the various
products available for purchase in the store. For the convenience of online shopping, a shopping cart is provided to the user.
After the selection of the goods, it is sent for the order confirmation process. The system is implemented using Python’s web
framework Django. To develop an e-commerce website, it is necessary to study and understand many technologies.
Scope: The scope of the project will be limited to some functions of the e-commerce website. It will display products,
customers can select catalogs and select products, and can remove products from their cart specifying the quantity of each
item. Selected items will be collected in a cart. At checkout, the item on the card will be presented as an order. Customers
can pay for the items in the cart to complete an order. This project has great future scope. The project also provides security
with the use of login ID and passwords, so that no unauthorized users can access your account. The only authorized person
who has the appropriate access authority can access the software.
Technologies used in the project:
Django framework and SQLite database which comes by default with Django.
Required Skillset to Build the Project:
Knowledge of Python and basics of Django Framework.
ER and Use-Case Diagrams

Customer Interface:
• Customer shops for a product
• Customer changes quantity
• The customer adds an item to the cart
• Customer views cart
• Customer checks out
• Customer sends order
Use-Case diagram for Customer
Admin Interface:
• Admin logs in
• Admin inserts item
• Admin removes item
• Admin modifies item

Step by Step Implementation:
• Create Normal Project: Open the IDE and create a normal project by selecting File -> New Project.
• Install Django: Next, we will install the Django module from the terminal. We will use PyCharm integrated terminal to do
this task. One can also use cmd on windows to install the module by running python -m pip install django command
• Check Installed Django version: To check the installed Django version, you can run the python -m django -version command
as shown below.
• Create Django Project: When we execute django-admin startproject command, then it will create a Django project inside
the normal project which we already have created here. django-admin startproject ProjectName.
• Check Python3 version: python3 –version
• Run Default Django webserver:- Django internally provides a default webserver where we can launch our
applications. python manage.py runserver command in terminal. By default, the server runs on port 8000. Access the
webserver at the highlighted URL.

Open the project folder using a text editor. The
directory structure should look like this : ->

Views :
In views, we create a view
named home.py, login.py,
signup.py, cart.py, checkout.py,
orders.py which takes a request
and renders an HTML as a
response. Create an home.html,
login.html, signup.html, cart.html,
checkout.html, orders.html in the
templates. And map the views to
the storeurls.py folder.

-Tableau
HR-Analytics
ABOUT THE PROJECT:
Every company has a HR Department which deal with various recruitment and placement related work of the company.
Here in this project there is huge dataset from which insights are drawn which can be useful for HR department to work on
and to gain knowledge about the recruitment process of the market.
ABOUT THE DATASET:
The dataset was provided by Ineuron.ai in .sql format. It was scraped by them for location Hong-Kong.
INSIGHTS TO BE FOUND?
• What are total number of jobs available?
• What are total number of companies providing jobs?
• What are total jobs for various domains?
• What are various career level and it’s distribution across various jobs?
• What are distribution of jobs across various analytics fields?
• Which company is providing highest number of jobs?
• Which domain has the highest number of jobs?
• What are various job types for different job titles?
• Which are TOP 5 companies with highest jobs?

-Tableau

Data Science Job ready #DataScienceInterview Question and Answers 2022 | #DataScienceProjects

-Tableau
HR-Analytics
HOW TO RUN THE PROJECT:
1. Firstly we need MySQL Workbench installed in system
2. Connect to localhost and then import the files that are in sql format to see the data. See the figure below and then
check for the table.
Note: Validation for results (insights) drawn can be done by simple queries from MySql workbench too.
3. After that you'll have the data in your database. then we can connect with tableau providing the id passward and
proceed ahead.
4. Final step : once you tableau is connected to MySQL just open the Tableau file to see the dashboards.

How to Prepare for Data Science Job Interview
Interviews are a unique social situation that many may find extremely stressful to deal with. The key to doing well in
interviews is to keep calm and to do the proper preparation ahead of time.
Make a plan for the logistics: bring multiple copies of your up-to-date resume, and arrive 10 to 15 minutes early, mapping out
your transportation route to the interview location the night before. In the interview itself, remember to keep your answers
focused and to the point, directly answering the question being asked. While of course context is important, you also have a
limited amount of time in the conversation to create a good first impression. Be mindful to not go off into tangential
ramblings.
Do your research on the company itself, and do your best to understand their goals, culture, and product. Be able to describe
exactly what the company does in your own words. Understanding the company culture can be key to helping you decide
what attire would be appropriate for the interview, whether it be full formal or office casual. In addition, resources such as
Glassdoor can give you insight into the structure of the data science interview, and even exactly what questions will be asked.
Thank You

Data Science Job ready #DataScienceInterview Question and Answers 2022 | #DataScienceProjects

More Related Content

Similar to Data Science Job ready #DataScienceInterview Question and Answers 2022 | #DataScienceProjects (20)

More from Rohit Dubey (15)

Recently uploaded (20)

Data Science Job ready #DataScienceInterview Question and Answers 2022 | #DataScienceProjects