博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
大数据数据科学家常用面试题_面试有关数据科学,数据理解和准备的问答
阅读量:2519 次
发布时间:2019-05-11

本文共 15942 字,大约阅读时间需要 53 分钟。

大数据数据科学家常用面试题

问题1:在数据科学术语中,您如何称呼所分析的数据? (Q1: In the data science terminology, how do you call the data that you analyze?)

In data science, you analyze datasets. Datasets consists of cases, which are the entities you analyze. Cases are described by their variables, which represent the attributes of the entities. The first important question you need to answer when you start a data science project is what exactly is your case. Is this a person, a family, an order? Then you collect all of the knowledge about each case you can get and store this information in the variables.

在数据科学中,您可以分析数据集 。 数据集包含case ,它们是您要分析的实体。 案例由其变量描述,这些变量代表实体的属性。 启动数据科学项目时需要回答的第一个重要问题是您的情况是什么。 这是一个人,一个家庭,一个命令吗? 然后,您将收集有关每种情况的所有知识,并将这些信息存储在变量中。

For more information, see the article:

有关更多信息,请参阅文章:

问题2:用于存储数据集的数据结构是什么? (Q2: What are the data structures you use to store the datasets?)

In SQL Server, you store the dataset you are analyzing in a table. A table can by physical or a virtual one, which is a view. SQL Server tables follow the relational model, meaning that they represent a set. Set theory is the basis for the relational model. Cases are rows, and variables columns.

在SQL Server中,将要分析的数据集存储在table中 。 一个表可以是物理表,也可以是虚拟表,即视图 。 SQL Server表遵循关系模型,这意味着它们表示一个集合。 集合论是关系模型的基础。 案例是 ,变量是

R introduces the structure called data frame. A data frame is a matrix. The basis is linear algebra.

R引入了称为数据帧的结构。 数据帧是矩阵 。 基础是线性代数。

Python follows R and also introduces data frames. However, unlike with R, where the data frame structure is a part of the core engine, you need to use an additional library called pandas in Python in order to get data frames available.

Python遵循R并引入了数据框架。 但是,与R不同,R的数据帧结构是核心引擎的一部分,您需要在Python中使用一个称为pandas的附加库才能获得可用的数据帧。

For more information, see the article:

有关更多信息,请参阅文章:

问题3:表格和数据框之间有区别吗? (Q3: Is there a difference between a table and a data frame?)

Yes, there is an important difference. An SQL Server table represents a set, a data frame a matrix. Because the order of rows and columns in a set is not defined, you cannot refer to the data values by their position. In order to read a value from a table, you need to know the table name, the key of the row, and the name of the column. In an R or Python data frame, you can use numerical positional indexes to read the data values in their position.

是的,有一个重要的区别。 SQL Server表代表一个集合,一个数据帧代表一个矩阵。 由于未定义集合中行和列的顺序,因此无法通过position引用数据值。 为了从表中读取值,您需要知道表名称,行的键和列的名称。 在R或Python数据框中,可以使用数字位置索引读取其位置的数据值。

For more information, see the article:

有关更多信息,请参阅文章:

Q4:是否所有变量都以相同的方式测量数据值? (Q4: Do all variables measure data values in the same way?)

No, there are two basic classes of variables:

不,变量有两个基本类别:

  • Discrete variables have a limited pool of possible distinct values. Discrete variables can be categorical, ordinal, or dichotomous. 离散变量的可能独特值池有限。 离散变量可以是分类的,有序的或二分的。
  • Continuous variables have an infinite pool of values. Continuous variables can be open or closed intervals or true numerical variables, or true numeris. 连续变量具有无限的值池。 连续变量可以是打开或关闭的间隔,也可以是真实的数值变量,也可以是真实的数值。

For more information, see the article:

有关更多信息,请参阅文章:

问题5:如何了解离散变量的分布? (Q5: How do you get an overview of a distribution of a discrete variable?)

You should use frequency tables, or shortly frequencies. There can many different pieces of information in a frequency table. At minimum, there must be values and counts of those values, or an absolute frequency. You can also show the absolute percentage, the cumulative frequency, and the cumulative percent. Graphically, you use bar charts and histograms.

您应该使用频率表 ,或者简称频率 。 频率表中可以包含许多不同的信息。 至少必须有这些值的值和计数或绝对频率。 您还可以显示绝对百分比,累积频率和累积百分比。 在图形上,您使用条形图直方图

For more information, see the article:

有关更多信息,请参阅文章:

问题6:您对序数离散变量和分类变量的处理方式不同吗? (Q6: Do you treat ordinal discrete variables differently than categorical?)

Ordinal variables, or ordinals, have an intrinsic order. You need to define this order correctly to get the appropriate results when you use them for analysis.

序数变量或序数具有固有顺序 。 使用它们进行分析时,需要正确定义此顺序才能获得适当的结果。

Categorical variables are also called nominal, because they provide only the name for each category, and nothing more.

分类变量也称为名义变量,因为它们仅提供每个类别的名称,仅此而已。

For more information, see the article:

有关更多信息,请参见以下文章:

Q7:您如何定义顺序的顺序? (Q7: How do you define the order of ordinals?)

There are many ways how to define the order of ordinals correctly. In T-SQL, you can use the CASE expression to modify the data values. In R and Python, you have additional possibilities besides modifying the data values. In R, you define all discrete variables as factors, and distinct values are called levels. For ordinals, you also define the order of the levels. In Python, you use the astype(‘category’) method of a data frame to define discrete variables, and then the pandas cat.reorder_categories() function to define the proper order.

有许多方法可以正确定义序数顺序。 在T-SQL中,可以使用CASE表达式来修改数据值。 在R和Python中,除了修改数据值之外,还有其他可能性。 在R中,将所有离散变量定义为因数 ,不同的值称为level 。 对于普通对象,您还可以定义级别的顺序。 在Python中,您可以使用数据框的astype('category')方法定义离散变量,然后使用pandas cat.reorder_categories()函数定义正确的顺序。

For more information, see the article:

有关更多信息,请参见以下文章:

问题8:如何将离散变量转换为数字? (Q8: How do you convert discrete variables to numerics?)

If the values are ordinals, you can use the numeric positional index of the categories as the numerical representation of the variable. For nominals, you cannot use a number for a category, because numbers have an intrinsic order. You create a new indicator variable for every possible value of a nominal, showing whether the value is taken or not. Such indicators are called dummies.

如果值是常规的,则可以使用类别的数字位置索引作为变量的数字表示形式。 对于名词,不能将数字用于类别,因为数字具有固有顺序。 您为标称值的每个可能值创建一个新的指标变量,以显示该值是否被采用。 这样的指标称为假人

For more information, see the article:

有关更多信息,请参见以下文章:

Q9:您如何从名义上创建假人? (Q9: How do you create dummies from nominals?)

In SQL Server, you can use the CASE expression again. In addition, since there are only two possibilities for a dummy, 0 or 1, you can also use the T-SQL IIF() function, which is a shortcut for CASE when you have only two possible outcomes.

在SQL Server中,可以再次使用CASE表达式。 另外,由于虚拟对象只有两种可能,即0或1,所以您也可以使用T-SQL IIF()函数,当只有两种可能的结果时,这是CASE的快捷方式。

In R, you can use the dummy() function from the dummies package. In Python, you can use the pandas get_dummies() function.

在R中,您可以使用dummies包中的dummy()函数。 在Python中,您可以使用pandas get_dummies()函数。

For more information, see the article:

有关更多信息,请参见以下文章:

问题10:如何将数字变量转换为离散变量? (Q10: How do you convert numerical variables to discrete?)

Binning, or discretization, is the process of creating discrete variables from numerics. Note that there are many different ways of binning. Nevertheless, since numerics have an order, you typically want to preserve the order. Thus, you create ordinals from numerics.

Binning离散化是从数字创建离散变量的过程。 请注意,有许多不同的合并方法。 但是,由于数字具有顺序,因此您通常希望保留该顺序。 因此,您可以根据数字创建序数。

For more information, see the article:

有关更多信息,请参见文章:

问题11:离散化有哪些不同的方式? (Q11: What are the different way of discretization?)

There are many different ways of doing the binning. Each possibility has its own advantages and disadvantages. The most popular ones are:

进行装箱有许多不同的方法。 每种可能性都有其自身的优点和缺点。 最受欢迎的是:

  • Equal width binning, where the width of each bin (the interval of a continuous variable) is equal 等宽合并,每个合并的宽度(连续变量的间隔)相等
  • Equal height binning, where you have equal number of cases in each bin, but the width of the bins varies 高度相等的分箱,每个箱中的箱子数均相等,但箱的宽度不同
  • Custom binning, where you define the bins based on the content of the data, or based on the business logic
  • 自定义合并,您可以在其中基于数据的内容或业务逻辑来定义合并

For more information, see the article:

有关更多信息,请参见文章:

问题12:不同的分箱方式有哪些优势? (Q12: What are the advantages of different ways of binning?)

The most important advantages of different ways of binning are:

不同分箱方式的最重要优点是:

  • For equal width binning, you preserve the shape of the distribution
  • 对于等宽度的装仓,您可以保留分布的形状
  • preserve the information in the variable 将信息保留在变量中
  • follow the logic of real life 遵循现实生活的逻辑

For more information, see the article:

有关更多信息,请参见文章:

问题13:您如何离散化? (Q13: How do you do the discretization?)

In T-SQL, you use the CASE expression for equal width and custom binning. You can use the NTILE() window function for equal height binning.

在T-SQL中,将CASE表达式用于等宽和自定义合并。 您可以使用NTILE()窗口函数进行等高合并。

In Python, you use the pandas cut() function for equal width and custom binning. For equal height binning, you can use the qcut() function.

在Python中,您可以使用pandas cut()函数来实现等宽和自定义合并。 对于等高合并,可以使用qcut()函数。

In R, you can use the cut() function from the base installation for equal width and custom binning. For equal height binning, you can search for a function is some additional package. Alternatively, it is quite simple to write your own function.

在R中,可以将基本安装中的cut()函数用于相同宽度和自定义装箱。 对于等高装仓,您可以搜索功能是一些其他程序包。 另外,编写自己的函数非常简单。

For more information, see the article:

有关更多信息,请参见文章:

问题14:是否可以衡量变量中的信息量? (Q14: Is there a measure for the amount of information in a variable?)

Yes, the measure for the information is the entropy. Entropy is defined in the information theory branch of applied mathematics. Basically, the information is the same thing as the surprise. If you already know a piece of information, you are not surprised when somebody tells it to you or when you read or see it again. Entropy is a quantified measure for the information.

是的,信息的量度是 。 熵是在应用数学的信息论分支中定义的。 基本上,信息就是惊奇。 如果您已经知道一条信息,那么当有人告诉您或再次阅读或看到它时,您不会感到惊讶。 熵是信息的量化度量。

For more information, see the article:

有关更多信息,请参见以下文章:

Q15:哪个变量具有较高的熵,合并的常数或相等的高度? (Q15: Which variable has a higher entropy, a constant or an equal height binned one?)

Entropy is also uncertainty. The more uncertainty, the higher possible surprise. There is no uncertainty in a variable that occupies only one value for all cases, for a constant. Entropy of a constant is zero. Constants are not useful for analyses. On the other hand, variables with equal number of cases in each class have the maximal possible uncertainty for a specific number of distinct classes.

熵也是不确定性。 不确定性越大,可能带来的惊喜就越大。 对于常数 ,在所有情况下仅占据一个值的变量没有不确定性。 常数的熵为零。 常数对分析没有用。 另一方面,对于特定数量的不同类别,每个类别中个案数量相等的变量具有最大的不确定性。

For more information, see the article:

有关更多信息,请参见文章:

问题16:哪个变量具有较高的熵,数字或离散的熵? (Q16: Which variable has a higher entropy, a numeric or a discrete one?)

Numerical variables can have more information than discrete ones, although this depends on the distribution. The maximal possible entropy of a discrete variable is when you have equal height distribution. The maximal possible entropy of a continuous variable for a given variance is when the distribution is normal. With more distinct classes of a discrete variable, the maximal possible entropy raises. Therefore, when you discretize a variable, you are losing some entropy.

数字变量比离散变量具有更多信息,尽管这取决于分布。 当您具有相等的高度分布时,离散变量的最大可能熵。 给定方差的连续变量的最大可能熵是当分布为正态时。 随着离散变量的更多不同类,最大可能的熵升高。 因此,当离散化变量时,您将失去一些熵。

For more information, see the article:

有关更多信息,请参见文章:

问题17:如何计算离散变量的熵? (Q17: How do you calculate the entropy of a discrete variable?)

In T-SQL, you need to do your own calculation, there is no function out of the box. You need the aggregate functions, window aggregate functions, and the LOG() function.

在T-SQL中,您需要自己计算,没有开箱即用的功能。 您需要聚合函数窗口聚合函数LOG()函数。

In Python, you can use the stats.entropy() function from the scipy library. However, you need to calculate counts in advance.

在Python中,您可以使用scipy库中的stats.entropy()函数。 但是,您需要提前计算计数。

In R, you can use the Entropy() function from the DescTools package. This function also expects the counts as the inputs.

在R中,可以使用DescTools包中的Entropy()函数。 此功能还希望将计数作为输入。

For more information, see the article:

有关更多信息,请参见文章:

问题18:R和Python数据框位置索引是否相同? (Q18: Are R and Python data frame positional indexes the same?)

No, there is a slight difference when you refer to the data in R or in python Pandas data frame. Python index is zero-based, R one-based. When you define an interval by an index, in Python the interval does not include the upper bound, while in R does. Therefore, TM[3:6, 1:4] returns the same rows and columns as TM.iloc[2:6, 0:4] in Python

不,当您在R或python Pandas数据框中引用数据时,会略有不同。 Python索引从零开始,R从一开始。 当您通过索引定义间隔时,在Python中间隔不包括上限,而在R中则包括上限。 因此,TM [3:6,1:4]返回与Python中TM.iloc [2:6,0:4]相同的行和列

For more information, see the article:

有关更多信息,请参见文章:

Q19:您在T-SQL中使用什么来读取数据? (Q19: What do you use in T-SQL to read the data?)

In T-SQL, you use, or course, the mighty SELECT statement to read the data. The core elements are:

在T-SQL中,您当然可以使用强大的SELECT语句读取数据。 核心要素是:

  • The FROM clause, where you define the source tables and how to join them
  • FROM子句,您可以在其中定义源表以及如何 联接它们
  • SELECT clause, where you define the columns, or the SELECT子句,您可以在其中定义列, projection, and the computed columns 投影或计算列
  • WHERE clause, where you define the WHERE子句,您可以在其中定义行的filters for the rows 过滤器
  • ORDER BY clause, where you define the ORDER BY子句,您可以在其中定义返回的行的order of the rows returned 顺序

For more information, see the article:

有关更多信息,请参见文章:

Q20:R中数据帧的基本操作是什么? (Q20: What are the basic operations on data frames in R?)

In R, here are some basic operations on data frames:

在R中,这是对数据帧的一些基本操作:

  • positions for rows and columns 位置检索值
  • Retrieve the values by row and column names
  • 通过行和列检索值
  • merge() function merge()函数连接两个数据帧
  • Bind data frames 按列 by columns 绑定数据帧
  • Bind data frames 按行 by rows 绑定数据帧

For more information, see the article:

有关更多信息,请参见文章:

Q21:Python中对数据帧的基本操作是什么? (Q21: What are the basic operations on data frames in Python?)

In pandas, there are many functions that help you operating on data frames. The most basic manipulations you can do include:

在熊猫中,有许多功能可以帮助您处理数据框。 您可以执行的最基本的操作包括:

  • Creating projections by using a subset of columns

    使用列的子集创建投影
  • positions for rows and columns 位置检索值
  • predicate for the rows and columns 谓词找到行和列的值
  • merge() function merge()函数连接两个数据框
  • Reorder a data frame with the sort() method
  • 使用sort()方法数据框重新排序

For more information, see the article:

有关更多信息,请参见文章:

Q22:如何在T-SQL中聚合数据? (Q22: How do you aggregate data in T-SQL?)

If you need to aggregate the data in T-SQL, you use the aggregate functions. There are the basic aggregate functions available, including some statistical functions; however, the set of aggregate functions is quite limited.

如果您需要在T-SQL中汇总数据,请使用汇总函数 。 有可用的基本集合函数,包括一些统计函数; 但是,集合函数的集合非常有限。

If you need to perform the aggregations in groups, you use the GROUP BY clause. If you need to filter on the aggregated values, you use the HAVING clause. You can create multiple groupings in a single statement with the GROUPING SETS clause.

如果需要按组执行聚合,请使用GROUP BY子句。 如果需要过滤汇总值,请使用HAVING子句。 您可以使用GROUPING SETS子句在单个语句中创建多个分组。

For more information, see the article:

有关更多信息,请参阅文章:

问题23:如何在R中汇总数据? (Q23: How do you aggregate data in R?)

In R, you can start with the summary() function from the base installation. It gives you a quick overview of a variable distribution with descriptive statistics. It returns multiple measures. There are many individual aggregates and statistical functions already in the base package. Of course, you get countless more with additional packages. You can use the aggregate() function from the base installation to do the aggregations in groups.

在R中,您可以从基本安装中使用summary()函数开始。 它为您提供了具有描述性统计信息的变量分布的快速概述。 它返回多个度量。 基本软件包中已经有许多单独的汇总和统计功能。 当然,有了其他软件包,您可以获得更多。 您可以使用基础安装中的aggregate()函数按组进行聚合。

For more information, see the article:

有关更多信息,请参阅文章:

问题24:如何在Python中汇总数据? (Q24: How do you aggregate data in Python?)

In Python, you use the pandas aggregations on the data frames. A pandas data frame has the describe() method, which gives you an overview of a distribution with descriptive statistics similarly to the R summary() function. There are many more pandas data frame methods for calculating individual descriptive statistics values. You can use the groupby() data frame method to calculate the aggregations in groups.

在Python中,您在数据框上使用了熊猫聚合。 熊猫数据框具有describe()方法,与R summary()函数类似,它为您提供了具有描述性统计信息的分布概述。 还有更多的熊猫数据框方法可用于计算单个描述性统计值。 您可以使用groupby()数据帧方法来按组计算聚合。

For more information, see the article:

有关更多信息,请参阅文章:

目录 (Table of contents)

Interview questions and answers about data science, data understanding and preparation
面试有关数据科学,数据理解和准备的问答

翻译自:

大数据数据科学家常用面试题

转载地址:http://xwnwd.baihongyu.com/

你可能感兴趣的文章
并发编程之协程
查看>>
C++11新标准
查看>>
架构之日志分析平台
查看>>
js_oop封装
查看>>
逐行解释和整体解释的理解
查看>>
安装linux 系统报错:No DEFAULT or UI configuration directive found 解决方法
查看>>
地图平移等地图操作功能
查看>>
ubuntu chmod命令的使用
查看>>
bzoj4554: [Tjoi2016&Heoi2016]游戏
查看>>
Linux解压命令大全
查看>>
c++、java、oc函数的重载及部分代码
查看>>
Python基础之条件表达式、运算符
查看>>
TCP/UDP简易通信框架源码,支持轻松管理多个TCP服务端(客户端)、UDP客户端
查看>>
【UWP开源】图片编辑器,带贴图、滤镜、涂鸦等功能
查看>>
HDU - 1525 博弈 暴力分析
查看>>
pod 安装 Masonry 遇到问题
查看>>
(转)OpenCV中的常用函数
查看>>
poj 3264 Balanced Lineup(线段树、RMQ)
查看>>
CSS实现水平垂直居中
查看>>
使用js实现水波效果
查看>>