(Study Material) Statistic : Study Material (Logic, Ethics And Mimasa)

Statistics Basics


1.1 What is Statistics
1.2 Uses
1.3 Distrust of Statistics
1.4 Statistics can be misused
1.5 Types of Statistics
1.6 Common mistakes committed in interpretation of Statistics
1.7 Glossary Of Terms

 

1.1 What is Statistics

The word 'Statistics' is derived from the Latin word 'Statis' which means a "political state." Clearly, statistics is closely linked with the administrative affairs of a state such as facts and figures regarding defense force, population, housing, food, financial resources etc. What is true about a government is also true about industrial administration units, and even one’s personal life.

The word statistics has several meanings. In the first place, it is a plural noun which describes a collection of numerical data such as employment statistics, accident statistics, population statistics, birth and death, income and expenditure, of exports and imports etc. It is in this sense that the word 'statistics' is used by a layman or a newspaper.

Secondly the word statistics as a singular noun, is used to describe a branch of applied mathematics, whose purpose is to provide methods of dealing with a collections of data and extracting information from them in compact form by tabulating, summarizing and analyzing the numerical data or a set of observations.

The various methods used are termed as statistical methods and the person using them is known as a statistician. A statistician is concerned with the analysis and interpretation of the data and drawing valid worthwhile conclusions from the same.

It is in the second sense that we are writing this guide on statistics.

Lastly the word statistics is used in a specialized sense. It describes various numerical items which are produced by using statistics ( in the second sense ) to statistics ( in the first sense ). Averages, standard deviation etc. are all statistics in this specialized third sense.

The word ’statistics’ in the first sense is defined by Professor Secrit as follows:-

"By statistics we mean aggregate of facts affected to a marked extent by multiplicity of causes, numerically expressed, enumerated or estimated according to reasonable standard of accuracy, collected in a systematic manner for a predetermined purpose and placed in relation to each other."

This definition gives all the characteristics of statistics which are (1) Aggregate of facts (2) Affected by multiplicity of causes (3) Numerically expressed (4) Estimated according to reasonable standards of accuracy (5) Collected in a systematic manner (6) Collected for a predetermined purpose (7) Placed in relation to each other.
The word 'statistics' in the second sense is defined by Croxton and Cowden as follows:-

"The collection, presentation, analysis and interpretation of the numerical data."

This definition clearly points out four stages in a statistical investigation, namely:

1) Collection of data     2) Presentation of data
3) Analysis of data        4) Interpretation of data

 


1.2 Uses

To present the data in a concise and definite form : Statistics helps in classifying and tabulating raw data for processing and further tabulation for end users.

To make it easy to understand complex and large data : This is done by presenting the data in the form of tables, graphs, diagrams etc., or by condensing the data with the help of means, dispersion etc.

For comparison : Tables, measures of means and dispersion can help in comparing different sets of data..

In forming policies : It helps in forming policies like a production schedule, based on the relevant sales figures. It is used in forecasting future demands.

Enlarging individual experiences : Complex problems can be well understood by statistics, as the conclusions drawn by an individual are more definite and precise than mere statements on facts.

In measuring the magnitude of a phenomenon:- Statistics has made it possible to count the population of a country, the industrial growth, the agricultural growth, the educational level (of course in numbers).
 
Limitations

Statistics does not deal with individual measurements. Since statistics deals with aggregates of facts, it can not be used to study the changes that have taken place in individual cases. For example, the wages earned by a single industry worker at any time, taken by itself is not a statistical datum. But the wages of workers of that industry can be used statistically. Similarly the marks obtained by John of your class or the height of Beena (also of your class) are not the subject matter of statistical study. But the average marks or the average height of your class has statistical relevance.
Statistics cannot be used to study qualitative phenomenon like morality, intelligence, beauty etc. as these can not be quantified. However, it may be possible to analyze such problems statistically by expressing them numerically. For example we may study the intelligence of boys on the basis of the marks obtained by them in an examination.

Statistical results are true only on an average:- The conclusions obtained statistically are not universal truths. They are true only under certain conditions. This is because statistics as a science is less exact as compared to the natural science.

Statistical data, being approximations, are mathematically incorrect. Therefore, they can be used only if mathematical accuracy is not needed.

Statistics, being dependent on figures, can be manipulated and therefore can be used only when the authenticity of the figures has been proved beyond doubt..
 
1.3 Distrust Of Statistics

It is often said by people that, "statistics can prove anything." There are three types of lies - lies, demand lies and statistics - wicked in the order of their naming. A Paris banker said, "Statistics is like a miniskirt, it covers up essentials but gives you the ideas."

Thus by "distrust of statistics" we mean lack of confidence in statistical statements and methods. The following reasons account for such views about statistics.

Figures are convincing and, therefore people easily believe them.

They can be manipulated in such a manner as to establish foregone conclusions.

The wrong representation of even correct figures can mislead a reader. For example, John earned $ 4000 in 1990 - 1991 and Jem earned $ 5000. Reading this one would form the opinion that Jem is decidedly a better worker than John. However if we carefully examine the statement, we might reach a different conclusion as Jem’s earning period is unknown to us. Thus while working with statistics one should not only avoid outright falsehoods but be alert to detect possible distortion of the truth.


1.4 Statistics Can Be Misused

In one factory which I know, workers were accusing the management for not providing them with proper working conditions. In support they quoted the number of accidents. When I considered the matter more seriously, I found that most of the staff was inexperienced and thus responsible for those accidents. Moreover many of the accidents were either minor or fake. I compared the working conditions of this factory to other factories and I found the conditions far better in this factory. Thus by merely noting the number of accidents and complaints of the workers, I would not dare to say that the working conditions were worse. On the other hand due to the proper statistical knowledge and careful observations I came to conclusion that the management was right.

Thus the usefulness of the statistics depends to a great extent upon its user. If used properly, by an efficient and unbiased statistician, it will prove to be an efficient tool.

Collection of facts and figures and deriving meaningful information from them is an important program.
Often it is not possible or practical to record observations of all the individuals of the groups from different areas, which comprise the population. In such a case observations are recorded of only some of the individuals of the population, selected at random. This selection of some individuals which will be a subset of the individuals in the original group, is called a Sample; i.e. instead of an entire population survey which would be time-consuming, the company will manage with a ‘Sample survey’ which can be completed in a shorter time.

Note that if a sample is representative of the whole population, any conclusion drawn from a statistical treatment of the sample would hold reasonably good for the population. This will of course, depend on the proper selection of the sample. One of the aims of statistics is to draw inferences about the population by a statistical treatment of samples.


1.5 Types Of Statistics


As mentioned earlier, for a layman or people in general, statistics means numbers - numerical facts, figures or information. The branch of statistics wherein we record and analyze observations for all the individuals of a group or population and draw inferences about the same is called "Descriptive statistics" or "Deductive statistics". On the other hand, if we choose a sample and by statistical treatment of this, draw inferences about the population, then this branch of statistics is known as Statical Inference or Inductive Statistics.
  
In our discussion, we are mainly concerned with two ways of representing descriptive statistics : Numerical and Pictorial.
Numerical statistics are numbers. But some numbers are more meaningful such as mean, standard deviation etc.

When the numerical data is presented in the form of pictures (diagrams) and graphs, it is called the Pictorial statistics. This statistics makes confusing and complex data or information, easy, simple and straightforward, so that even the layman can understand it without much difficulty.


1.6 Common Mistakes Committed In Interpretation of Statistics


Bias:- Bias means prejudice or preference of the investigator, which creeps in consciously and unconsciously in proving a particular point.
Generalization:- Some times on the basis of little data available one could jump to a conclusion, which leads to erroneous results.

Wrong conclusion:- The characteristics of a group if attached to an individual member of that group, may lead us to draw absurd conclusions.

Incomplete classification:- If we fail to give a complete classification, the influence of various factors may not be properly understood.

  • There may be a wrong use of percentages.
  • Technical mistakes may also occur.
  • An inconsistency in definition can even exist.
  • Wrong causal inferences may sometimes be drawn.
  • There may also be a misuse of correlation.

 

 

1.7 Glossary of Terms

Statistics :
 Statistics is the use of data to help the decision maker to reach better decisions.
 
Data :
 It is any group of measurements that interests us. These measurements provide information for the decision maker. (I) The data that reflects non-numerical features or qualities of the experimental units, is known as qualitative data. (ii) The data that possesses numerical properties is known as quantitative data.
 
Population:
 Any well defined set of objects about which a statistical enquiry is being made is called a population or universe.

The total number of objects (individuals) in a population is known as the size of the population. This may be finite or infinite.

Individual : Each object belonging to a population is called as an individual of the population.  
 
Sample:
 A finite set of objects drawn from the population with a particular aim, is called a sample. The total number of individuals in a sample is called the sample size.
  
Characteristic:
 The information required from an individual, from a population or from a sample, during the statistical enquiry (survey) is known as the characteristic of the individual. It is either numerical or non-numerical. For e.g. the size of shoes is a numerical characteristic which refers to a quantity, whereas the mother tongue of a person is a non-numerical characteristic which refers to a quality. Thus we have quantitative and qualitative types of characteristics.
 

Variate:
 A quantitative characteristic of an individual which can be expressed numerically is called a variate or a variable. It may take different values at different times, places or situations.

Attribute:
 A qualitative characteristic of an individual which can be expressed numerically is called an attribute. For e.g. the mother-tongue of a person, the color of eyes or the color of hair of a person etc.

Discrete variate :
 A variable that is not capable of assuming all the values in a given range is a discrete variate. 

 
Continuous Variate :
 A variate that is capable of assuming all the numerical values in a given range, is called a continuous variate. Consider two examples carefully, viz. the number of students of a class and their heights. Both variates differ slightly, in the sense that, the number of students present in a class is a number say between 0 and 50; always a whole number. It can never be 1.5, 4.33 etc. This type of variate can take only isolated values and is called a discrete variate. On the other hand heights ranging from 140 cm to 190 cm can take values like 140.7, 135.8, 185.1 etc. Such a variate is a continuous variate.

Introduction of Statistics

 

What do you understand by word “statistics”. Give out its definitions (minimum by 4 authors) as explained by various distinguished authors. It is a branch of applied mathematics concerned with the collection and interprets Statistics is the science and practice of developing human knowledge through the use of empirical data expressed in quantitative form. It is based on statistical theory which is a branch of applied mathematics. Within statistical theory, randomness and uncertainty are modeled by probability theory. Because one aim of statistics is to produce the "best" information from available data, some authors consider statistics a branch of decision theory. Statistics is a set of concepts, rules, and procedures that help us to: organize numerical information in the form of tables, graphs, and charts; understand statistical techniques underlying decisions that affect our lives and well-being; and make informed decisions. Statistics is closely related to data management. It is the study of the likelihood and probability of events occurring based on known information and inferred by taking a limited number of samples. It is a part of mathematics that deals with collecting, organizing, and analyzing data.

Statistics has been defined by various people to mean the following:.

  • According to HORACE SECRIST statistics is an aggregate of facts affected to a market extent by multiplicity of causes, numerically expressed, enumerated or estimated according to reasonable standard of accuracy, collected in a systemic manner for a predetermined purpose and placed in relation to each other. .
  • According to CROXTON AND COWDEN statistics is the science of collection, organizing, presentation, analysis and interpretation of numerical data. .
  • According to prof, YA LU CHOU statistics is a method of decision making is the face of uncertainty on the basis of numerical data and calculated risks. .
  • According to prof Wallis and Roberts statistics is not a body of substantive knowledge but a body of methods for obtaining knowledge.

 
Enumerate some important development of statistical theory; also explain merits and limitation of statistics.
Answer: theory of probability was initially developed by James Bernoulli, Daniel Bernoulli, La Place and Karl gauss according to this theory Probability starts with logic. There is a set of N elements. We can define a sub-set of n favorable elements, where n is less than or equal to N. Probability is defined as the rapport of the favorable cases over total cases, or calculated as: .

n .p = ----- N .

Normal curve discovered by Abraham de moivere (1687-1754). The normal distribution (also called the Gaussian distribution or the famous ``bell-shaped'' curve) is the most important and widely used distribution in statistics. The normal distribution is characterized by two parameters, and , namely, the mean and variance of the population having the normal distribution. .

Jacques quetlet (1796-1874) discovered the fundamental principle “the constancy of great numbers which became the basic of sampling. .

Regression developed by sir Francis galton . It evaluates the relationship between one variable (termed the dependent variable) and one or more other variables (termed the independent variables). It is a form of global analysis as it only produces a single equation for the relationship thus not allowing any variation across the study area. .

Karl Pearson developed the concept of square goodness of fit test. Sir Ronald fischer (1890-1962) made a major contribution in the field of experimental design turning into science. Since 1935 “design of experiments” has made rapid progress making collection and analysis of statistical prompter and more economical. Design of experiments is the complete sequence of steps taken ahead of time to ensure that the appropriate data will be obtained, which will permit an objective analysis and will lead to valid inferences regarding the stated problem
 

MERITS OF STATSTICS

  • Presenting facts in a definite form
  • Simplifying mass of figure- condensation into few significant figures
  • Facilitating comparison
  • Helping in formulating and testing of hypothesis and developing new theories.
  • Helping in predictions.
  • Helping in formulation of suitable policies.

LIMITATION OF STATSTICS

  • Does not deal with individual measurement.
  • Deals only with quantities characteristics.
  • Result is true only on an average.
  • It is only one of the methods of studying a problem.
  • Statistics can be measured. It requires skills to use it effectively, otherwise misinterpretation is possible
  • It is only a tool or means to an end and not the end itself which has to be intelligently identified using this tool.

Statistics Sets

Define elementary theory of sets, also explain various methods by giving suitable examples, narrate the utility of “set theory” in an organization.
a set is a collection of items, objects or elements, which are governed by a rule indicating weather an object belong to the set or not. In conventional notation of sets,

Alphabets like A, B, C, X, U; S etc are used to denote sets. Braces like ‘{ }’ are used as a notation for collection of objects or elements in the set.’ Greek letter epsilon “ ” is used to denote “belongs to”. A vertical line ‘|’ is used to denote expression ‘such that’. Alphabet ‘I’ is used to denote an ‘integer’. Using above notation a set called a considering of elements 0, 1, 2,3,4,5 May be mathematically denoted in any of the following manner


1) List or roster method
         This means all elements are actually listed.

        A= {0, 1, 2, 3, 4, 5} read as A is a set with elements 0, 1, 2,3,4,5

2) set builder or rule method (in which a mathematical rules, equality or in equality etc are specified to generate the elements of
     intended set)

        A={x | 0 < x <5}, X I} (I =1, 2, 3, 4, 5)

        Read as A is (a set of) (variable x) (such that) (x lies between 0 and 5, both inclusive) where variable x belongs to integers.

3) Universal set is a set consisting of all objects or elements

        Is a set consisting of all objects or elements of a type or a given interest and is normally depicted by alphabets X, U or S.

        E.g. A is set of all digits may be expressed as X= {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}

4) Finite set is one in which the number of elements can be counted.

        E.g. A = {1, 2, 3…15} having 15 elements

        Or a set of employees in an organization.

5) Infinite set is one in which the number of elements cannot be counted.

        E.g. a set of integers or real numbers.

6) Subset a is called a subset of B if every elements of a is also an element of B

        This is represented as A B and is read as “A is a sub set of B”

        E.g. each of sets A = {0, 1, 2, 3, 4, 5} OR THE SET B = {1, 3, 5} are subset of set C

        WHERE C= {0, 1, 2, 3, 4, 5}

7) Supersets A IS a superset of B. if every element of b is an element of B

        This represents as A B and read as “A is superset of B”.

8) Equal sets if A is sub set of B and B is a subset of A then A and B are called equal sets. This can be denoted as follows

        If A B and B A then A= B

UTILITY OF SET THEORY IN BUSINESS ORGANISATION.

A company consists of sets of resources like personnel, machines, material stocks and cash reserves. The relationship between these sets and between the subsets of each set is used to equate assets of one kind with another. The subsets of highly skilled production workers, within the sets of all production workers, are critical subset that determines the productivity or other personnel. Certain subsets of company products are highly profitable or certain material may be subject to deterioration and must be stocked in greater quantities than others. Thus the concept of sets is very useful in business management.
explain the meaning and type of data as applicable in any business. How would you classify and tabulate the data, support your answer with example.
data is any group of observation or measurement related to the area of a business interest and to be used for decision making. It can be of the following two types.

1) Qualitative data (representing non numeric feature or qualities of object under reference)

2) Quantitative data (that represent properties of object under reference with numeric details.) Data can also be of the following types:

  • Primary data (are observed and recorded as apart of an original experiment or survey)
  • Secondary data: (are compiled by someone other than the user of the data for decision making purpose)

Classification and tabulation of data:

Data can be classified by geographical areas; chronicle sequences; qualitative attributes like urban or rural, male or female, literate or illiterates, under graduate ,graduate or post graduate, employed or unemployed and so on; while the ,most frequently used method of classification of data is the quantitative classification.

Tabulation:

After data is classified it is represented in a tabular form. A self explanatory and comprehensive table has a table number, title of the table, caption (column or sub-column headings), stubs, body containing the main data which occupies the cells of the table after the data has been classified under various caption and stubs. Head notes are added at the top of the table for general information regarding the relevance of the table or for cross reference or links with other literature. Foot notes are appended for clarification, explanation or as additional comments on any of the cells in the table.

Statistics Mean and its types

Describe arithmetic, geometric and harmonic means with suitable example. Explain merits and limitation of geometric mean.
Arithmetic mean

The arithmetic mean is the "standard" average, often simply called the "mean". It is used for many purposes but also often abused by incorrectly using it to describe skewed distributions, with highly misleading results. The classic example is average income - using the arithmetic mean makes it appear to be much higher than is in fact the case. Consider the scores {1, 2, 2, 2, 3, 9}. The arithmetic mean is 3.16, but five out of six scores are below this!

The arithmetic mean of a set of numbers is the sum of all the members of the set divided by the number of items in the set. (The word set is used perhaps somewhat loosely; for example, the number 3.8 could occur more than once in such a "set".) The arithmetic mean is what pupils are taught very early to call the "average." If the set is a statistical population, then we speak of the population mean. If the set is a statistical sample, we call the resulting statistic a sample mean. The mean may be conceived of as an estimate of the median. When the mean is not an accurate estimate of the median, the set of numbers, or frequency distribution, is said to be skewed.

We denote the set of data by X = {x1, x2, ..., xn}. The symbol µ (Greek: mu) is used to denote the arithmetic mean of a population. We use the name of the variable, X, with a horizontal bar over it as the symbol ("X bar") for a sample mean. Both are computed in the same way:

The arithmetic mean is greatly influenced by outliers. In certain situations, the arithmetic mean is the wrong concept of "average" altogether. For example, if a stock rose 10% in the first year, 30% in the second year and fell 10% in the third year, then it would be incorrect to report its "average" increase per year over this three year period as the arithmetic mean (10% + 30% + (-10%))/3 = 10%; the correct average in this case is the geometric mean which yields an average increase per year of only 8.8%.


Geometric mean
The geometric mean is an average which is useful for sets of numbers which are interpreted according to their product and not their sum (as is the case with the arithmetic mean). For example rates of growth.

The geometric mean of a set of positive data is defined as the product of all the members of the set, raised to a power equal to the reciprocal of the number of members. In a formula: the geometric mean of a1, a2, ..., an is , which is . The geometric mean is useful to determine "average factors". For example, if a stock rose 10% in the first year, 20% in the second year and fell 15% in the third year, then we compute the geometric mean of the factors 1.10, 1.20 and 0.85 as (1.10 × 1.20 × 0.85)1/3 = 1.0391... and we conclude that the stock rose on average 3.91 percent per year. The geometric mean of a data set is always smaller than or equal to the set's arithmetic mean (the two means are equal if and only if all members of the data set are equal). This allows the definition of the arithmetic-geometric mean, a mixture of the two which always lies in between. The geometric mean is also the arithmetic-harmonic mean in the sense that if two sequences (an) and (hn) are defined: and Then an and hn will converge to the geometric mean of x and y.


Harmonic mean
The harmonic mean is an average which is useful for sets of numbers which are defined in relation to some unit, for example speed (distance per unit of time).

  • In mathematics, the harmonic mean is one of several methods of calculating an average.
  • The harmonic mean of the positive real numbers a1,...,an is defined to be

The harmonic mean is never larger than the geometric mean or the arithmetic mean (see generalized mean). In certain situations, the harmonic mean provides the correct notion of "average". For instance, if for half the distance of a trip you travel at 40 miles per hour and for the other half of the distance you travel at 60 miles per hour, then your average speed for the trip is given by the harmonic mean of 40 and 60, which is 48; that is, the total amount of time for the trip is the same as if you traveled the entire trip at 48 miles per hour. Similarly, if in an electrical circuit you have two resistors connected in parallel, one with 40 ohms and the other with 60 ohms, then the average resistance of the two resistors is 48 ohms; that is, the total resistance of the circuit is the same as it would be if each of the two resistors were replaced by a 48-ohm resistor. (Note: this is not to be confused with their equivalent resistance, 24 ohm, which is the resistance needed for a single resistor to replace the two resistors at once.) Typically, the harmonic mean is appropriate for situations when the average of rates is desired.

Another formula for the harmonic mean of two numbers is to multiply the two numbers, and divide that quantity by the arithmetic mean of the two numbers. In mathematical terms:
Merits and limitation of geometric mean


Merits:

  • It is based on each and every item of the series.
  • It is rigidly defined
  • It is useful in averaging ratio and percentage in determining rates of increase or decrease.
  • It gives less weight to large items and more to small items. Thus geometric mean of the geometric of values is always less than their arithmetic mean.
  • It is capable of algebraic manipulation like computing the grand geometric mean of the geometric mean of different sets of values.>
    Limitation:
  • It is relatively difficult to comprehend, compute and interpret.
  • A G.M with zero value cannot be compounded with similar other non-zero values with negative sign

Statistics Central Tendency

Explain the various measure of central tendency?
In statistics, the general level, characteristic, or typical value that is representative of the majority of cases. Among several accepted measures of central tendency employed in data reduction, the most common are the arithmetic mean (simple average), the median, and the mode. FOR EXAMPLE, one measure of central tendency of a group of high school students is the average (mean) age of the students. Central tendency is a term used in some fields of empirical research to refer to what statisticians sometimes call "location". A "measure of central tendency" is either a location parameter or a statistic used to estimate a location parameter. Examples include: #Arithmetic mean, the sum of all data divided by the number of observations in the data set.#Median, the value that separates the higher half from the lower half of the data set.#Mode, the most frequent value in the data set.

Measures of central tendency, or "location", attempt to quantify what we mean when we think of as the "typical" or "average" score in a data set. The concept is extremely important and we encounter it frequently in daily life. For example, we often want to know before purchasing a car its average distance per litre of petrol. Or before accepting a job, you might want to know what a typical salary is for people in that position so you will know whether or not you are going to be paid what you are worth. Or, if you are a smoker, you might often think about how many cigarettes you smoke "on average" per day. Statistics geared toward measuring central tendency all focus on this concept of "typical" or "average." As we will see, we often ask questions in psychological science revolving around how groups differ from each other "on average".

Answers to such a question tell us a lot about the phenomenon or process we are studying Arithmetic Mean The arithmetic mean is the most common measure of central tendency. It simply the sum of the numbers divided by the number of numbers. The symbol mm is used for the mean of a population. The symbol MM is used for the mean of a sample. The formula for mm is shown below: m=SXN m S X N where SX S X is the sum of all the numbers in the numbers in the sample and NN is the number of numbers in the sample. As an example, the mean of the numbers 1+2+3+6+8=205=4 1 2 3 6 8 20 5 4 regardless of whether the numbers constitute the entire population or just a sample from the population. The table, Number of touchdown passes, shows the number of touchdown (TD) passes thrown by each of the 31 teams in the National Football League in the 2000 season. The mean number of touchdown passes thrown is 20.4516 as shown below. m=SXN=63431=20.4516 m S X N 634 31 20.4516 Number of touchdown passes

37  33  33  32  29  28  28  23
22  22  22  21  21  21  20  20
19  19  18  18  18  18  16  15
14  14  14  12  12  9  6  


Although the arithmetic mean is not the only "mean" (there is also a geometic mean), it is by far the most commonly used. Therefore, if the term "mean" is used without specifying whether it is the arithmetic mean, the geometic mean, or some other mean, it is assumed to refer to the arithmetic mean. Median The median is also a frequently used measure of central tendency. The median is the midpoint of a distribution: the same number of scores are above the median as below it. For the data in the table, Number of touchdown passes, there are 31 scores. The 16th highest score (which equals 20) is the median because there are 15 scores below the 16th score and 15 scores above the 16th score. The median can also be thought of as the 50th percentile. Let's return to the made up example of the quiz on which you made a three discussed previously in the module Introduction to Central Tendency and shown in table 2. Three possible datasets for the 5-point make-up quiz

Student  Dataset 1  Dataset 2  Dataset 3
You  3  3  3
John's  3  4  2
Maria's  3  4  2
Shareecia's  3  4  2
Luther's  3  5  1

For Dataset 1, the median is three, the same as your score. For Dataset 2, the median is 4. Therefore, your score is below the median. This means you are in the lower half of the class. Finally for Dataset 3, the median is 2. For this dataset, your score is above the median and therefore in the upper half of the distribution. Computation of the Median: When there is an odd number of numbers, the median is simply the middle number. For example, the median of 2, 4, and 7 is 4. When there is an even number of numbers, the median is the mean of the two middle numbers. Thus, the median of the numbers 22, 44, 77, 1212 is 4+72=5.5 4 7 2 5.5 . mode The mode is the most frequently occurring value. For the data in the table, Number of touchdown passes, the mode is 18 since more teams (4) had 18 touchdown passes than any other number of touchdown passes. With continuous data such as response time measured to many decimals, the frequency of each value is one since no two scores will be exactly the same (see discussion of continuous variables). Therefore the mode of continuous data is normally computed from a grouped frequency distribution. The Grouped frequency distribution table shows a grouped frequency distribution for the target response time data. Since the interval with the highest frequency is 600-700, the mode is the middle of that interval (650). Grouped frequency distribution

Range  Frequency
500-600  3
600-700  6
700-800  5
800-900  5
900-1000  0
1000-1100  1
 

Tri-mean

The tri-mean is computed by adding the 25th percentile plus twice the 50th percentile plus the 75th percentile and dividing by four. What follows is an example of how to compute the tri-mean. The 25th, 50th, and 75th percentile of the dataset "Example 1" are 51, 55, and 63 respectively. Therefore, the tri-mean is computed as:
The tri-mean is almost as resistant to extreme scores as the median and is less subject to sampling fluctuations than the arithmetic mean in extremely skewed distributions. It is less efficient than the mean for normal distributions. . The tri-mean is a good measure of central tendency and is probably not used as much as it should be.
Trimmed Mean
A trimmed mean is calculated by discarding a certain percentage of the lowest and the highest scores and then computing the mean of the remaining scores. For example, a mean trimmed 50% is computed by discarding the lower and higher 25% of the scores and taking the mean of the remaining scores. The median is the mean trimmed 100% and the arithmetic mean is the mean trimmed 0%. A trimmed mean is obviously less susceptible to the effects of extreme scores than is the arithmetic mean. It is therefore less susceptible to sampling fluctuation than the mean for extremely skewed distributions. It is less efficient than the mean for normal distributions. Trimmed means are often used in Olympic scoring to minimize the effects of extreme ratings possibly caused by biased judges.

Statistics Dispersion

Which are various measure of dispersion, explain each of them?
Answer: In many ways, measures of central tendency are less useful in statistical analysis than measures of dispersion of values around the central tendency The dispersion of values within variables is especially important in social and political research because:

  • Dispersion or "variation" in observations is what we seek to explain.
  • Researchers want to know WHY some cases lie above average and others below  average for a given variable:
  • TURNOUT in voting: why do some states show higher rates than others?
  • CRIMES in cities: why are there differences in crime rates?
  • CIVIL STRIFE among countries: what accounts for differing amounts?
  • Much of statistical explanation aims at explaining DIFFERENCES in observations -- also known as
  • VARIATION, or the more technical term, VARIANCE


If everything were the same, we would have no need of statistics. But, people's heights, ages, etc., do vary. We often need to measure the extent to which scores in a dataset differ from each other. Such a measure is called the dispersion of a distribution Some measure of dispersion are

1) Range

The range is the simplest measure of dispersion. The range can be thought of in two ways
. 1. As a quantity: the difference between the highest and lowest scores in a distribution.
"The range of scores on the exam was 32."
2. As an interval; the lowest and highest scores may be reported as the range. "The range was 62 to 94," which would be written (62, 94).
The Range of a Distribution
Find the range in the following sets of data:
NUMBER OF BROTHERS AND SISTERS

 
{ 2, 3, 1, 1, 0, 5, 3, 1, 2, 7, 4, 0, 2, 1, 2,
1, 6, 3, 2, 0, 0, 7, 4, 2, 1, 1, 2, 1, 3, 5, 12,
4, 2, 0, 5, 3, 0, 2, 2, 1, 1, 8, 2, 1, 2 }
An outlier is an extreme score, i.e., an infrequently occurring score at either tail of the distribution. Range is determined by the furthest outliers at either end of the distribution. Range is of limited use as a measure of dispersion, because it reflects information about extreme values but not necessarily about "typical" values. Only when the range is "narrow" (meaning that there are no outliers) does it tell us about typical values in the data.

2) Percentile range

Most students are familiar with the grading scale in which "C" is assigned to average scores, "B" to above-average scores, and so forth. When grading exams "on a curve," instructors look to see how a particular score compares to the other scores. The letter grade given to an exam score is determined not by its relationship to just the high and low scores, but by its relative position among all the scores. Percentile describes the relative location of points anywhere along the range of a distribution. A score that is at a certain percentile falls even with or above that percent of scores. The median score of a distribution is at the 50th percentile: It is the score at which 50% of other scores are below (or equal) and 50% are above. Commonly used percentile measures are named in terms of how they divide distributions. Quartiles divide scores into fourths, so that a score falling in the first quartile lies within the lowest 25% of scores, while a score in the fourth quartile is higher than at least 75% of the scores.

Quartile Finder
The divisions you have just performed illustrate quartile scores. Two other percentile scores commonly used to describe the dispersion in a distribution are decile and quintile scores which divide cases into equal sized subsets of tenths (10%) and fifths (20%), respectively. In theory, percentile scores divide a distribution into 100 equal sized groups. In practice this may not be possible because the number of cases may be under 100. A box plot is an effective visual representation of both central tendency and dispersion. It simultaneously shows the 25th, 50th (median), and 75th percentile scores, along with the minimum and maximum scores. The "box" of the box plot shows the middle or "most typical" 50% of the values, while the "whiskers" of the box plot show the more extreme values. The length of the whiskers indicate visually how extreme the outliers are. Below is the box plot for the distribution you just separated into quartiles. The boundaries of the box plot's "box" line up with the columns for the quartile scores on the histogram. The box plot displays the median score and shows the range of the distribution as well.
By far the most commonly used measures of dispersion in the social sciences are Variance and standard deviation.

Variance is the average squared difference of scores from the mean score of a distribution. Standard deviation is the square root of the variance. In calculating the variance of data points, we square the difference between each point and the mean because if we summed the differences directly, the result would always be zero. For example, suppose three friends work on campus and earn $5.50, $7.50, and $8 per hour, respectively. The mean of these values is $(5.50 + 7.50 + 8)/3 = $7 per hour. If we summed the differences of the mean from each wage, we would get (5.50-7) + (7.50-7) + (8-7) = -1.50 + .50 + 1 = 0. Instead, we square the terms to obtain a variance equal to 2.25 + .25 + 1 = 3.50. This figure is a measure of dispersion in the set of scores. The variance is the minimum sum of squared differences of each score from any number. In other words, if we used any number other than the mean as the value from which each score is subtracted, the resulting sum of squared differences would be greater. (You can try it yourself -- see if any number other than 7 can be plugged into the preceeding calculation and yield a sum of squared differences less than 3.50.) The standard deviation is simply the square root of the variance. In some sense, taking the square root of the variance "undoes" the squaring of the differences that we did when we calculated the variance. Variance and standard deviation of a population are designated by and , respectively. Variance and standard deviation of a sample are designated by s2 and s, respectively.

4) Standard Deviation

The standard deviation ( or s) and variance ( or s2) are more complete measures of dispersion which take into account every score in a distribution. The other measures of dispersion we have discussed are based on considerably less information. However, because variance relies on the squared differences of scores from the mean, a single outlier has greater impact on the size of the variance than does a single score near the mean. Some statisticians view this property as a shortcoming of variance as a measure of dispersion, especially when there is reason to doubt the reliability of some of the extreme scores. For example, a researcher might believe that a person who reports watching television an average of 24 hours per day may have misunderstood the question. Just one such extreme score might result in an appreciably larger standard deviation, especially if the sample is small. Fortunately, since all scores are used in the calculation of variance, the many non-extreme scores (those closer to the mean) will tend to offset the misleading impact of any extreme scores.

The standard deviation and variance are the most commonly used measures of dispersion in the social sciences because:
• Both take into account the precise difference between each score and the mean. Consequently, these measures are based on a maximum amount of information.
• The standard deviation is the baseline for defining the concept of standardized score or "z-score".
• Variance in a set of scores on some dependent variable is a baseline for measuring the correlation between two or more variables (the degree to which they are related).

Statistics Z-Scores

Standardized Distribution Scores, or "Z-Scores"
Actual scores from a distribution are commonly known as a "raw scores." These are expressed in terms of empirical units like dollars, years, tons, etc. We might say "The Smith family's income is $29,418." To compare a raw score to the mean, we might say something like "The mean household income in the U.S. is $2,232 above the Smith family's income." This difference is an absolute deviation of 2,232 emirical units (dollars, in this example) from the mean. When we are given an absolute deviation from the mean, expressed in terms of empirical units, it is difficult to tell if the difference is "large" or "small" compared to other members of the data set. In the above example, are there many families that make less money than the Smith family, or only a few? We were not given enough information to decide. We get more information about deviation from the mean when we use the standard deviation measure presented earlier in this tutorial. Raw scores expressed in empirical units can be converted to "standardized" scores, called z-scores. The z-score is a measure of how many units of standard deviation the raw score is from the mean. Thus, the z-score is a relative measure instead of an absolute measure. This is because every individual in the dataset affects value for the standard deviation. Raw scores are converted to standardized z-scores by the following equations: Population z-score Sample z-score where is the population mean, is the sample mean, is the population standard deviation, s is the sample standard deviation, and x is the raw score being converted. For example, if the mean of a sample of I.Q. scores is 100 and the standard deviation is 15, then an I.Q. of 128 would correspond to:

= (128 - 100) / 15 = 1.87
For the same distribution, a score of 90 would correspond to:
z = (90 - 100) / 15 = - 0.67

A positive z-score indicates that the corresponding raw score is above the mean. A negative z-score represents a raw score that is below the mean. A raw score equal to the mean has a z-score of zero (it is zero standard deviations away). Z-scores allow for control across different units of measure. For example, an income that is 25,000 units above the mean might sound very high for someone accustomed to thinking in terms of U.S. dollars, but if the unit is much smaller (such as Italian Lires or Greek Drachmas), the raw score might be only slightly above average. Z-scores provide a standardized description of departures from the mean that control for differences in size of empirical units. When a dataset conforms to a "normal" distribution, each z-score corresponds exactly to known, specific percentile score. If a researcher can assume that a given empirical distribution approximates the normal distribution, then he or she can assume that the data's z-scores approximate the z-scores of the normal distribution as well.

In this case, z-scores can map the raw scores to their percentile scores in the data. As an example, suppose the mean of a set of incomes is $60,200, the standard deviation is $5,500, and the distribution of the data values approximates the normal distribution. Then an income of $69,275 is calculated to have a z-score of 1.65. For a normal distribution, a z-score of 1.65 always corresponds to the 95th percentile. Thus, we can assume that $69,275 is the 95th percentile score in the empirical data, meaning that 95% of the scores lie at or below $69,275. The normal distribution is a precisly defined, theoretical distribution. Empirical distributions are not likely to conform perfectly to the normal distribution. If the data distribution is unlike the normal distribution, then z-scores do not translate to percentiles in the "normal" way. However, to the extent that an empirical distribution approximates the normal distribution, z-scores do translate to percentiles in a reliable way.

Statistics Probability

 

What to you understand by concept of probability. Explain various theories of probability
Probability is a branch of mathematics that measures the likelihood that an event will occur. Probabilities are expressed as numbers between 0 and 1. The probability of an impossible event is 0, while an event that is certain to occur has a probability of 1. Probability provides a quantitative description of the likely occurrence of a particular event. Probability is conventionally expressed on a scale of zero to one. A rare event has a probability close to zero. A very common event has a probability close to one.


Four theories of probability
1)Classical or a priori probability: this is the oldest concept evolved in 17th century and based on the assumption that outcomes of random experiments (like tossing of coin, drawing cards from a pack or throwing a die) are equally likely. For this reason this is not valid in the following cases.

  • Where outcomes of experiments are not equally likely, for example lives of different makes of bulbs.
  • Quality of products from a mechanical plant operated under different condition. However, it is possible to mathematically work out the probability of complex events, despite of these demerits. A priori probabilities are of considerable importance in applied statistics.

2) Empirical concept: this was developed in 19th centaury for insurance business data and is based on the concept of relative frequency. It is based on historical data being used for future prediction. When we toss a coin, the probability of a head coming up is ½ because there are two equally likely events, namely appearance of a head or that of a tail. This is an approach of determining a probability from deductive logic.

3) Subjective or personal approach. This approach was adopted by frank Ramsey in 1926 and developed by others. It is based on personal beliefs of the person making the probability statement based on past information, noticeable trends and appreciation of futuristic situation. Experienced people use this approach for decision making in their own field.

4) Axiomatic approach: this approach was introduced by Russian mathematician A N Kolmogorov in 1933. His concept of probability is considered as a set of function, no precise definition is given but following axioms or postulates are adopted.

  • The probability of an event ranges from 0 to 1. That is, an event surely not be happen has probability 0 and another event sure to happen is associated with probability 1.
  • The probability of an entire sample space (that is any, some or all the possible outcomes of an experiment) is 1. Mathematically, P(S) =1

What is “chi-square” test, narrate the steps for determining value of x2 with suitable examples. Explain the condition for applying x2 and uses of chi-square test
This test was developed by Karl Pearson (1857-1936), analytical situation and professor of applied mathematics, London, Whose concept of coefficient of correlation is most widely used. This r=test consider the magnitude of dependency between theory and observation and is defined as

Where Oi is the observed frequency

E= expected frequencies

Steps for determining value of x2

1) When data is given in a tabulated form calculated form expected frequencies for each cell using the following formula

E = (row total) * (column total)/total number of observation.

2) Take difference between O and E for each cell and calculate their square (O-E) 2

3) Divide (O-E) 2 by respective expected frequencies and total up to get x2.

4) Compare calculated value with table value at given degree of freedom and specified level of significance. If at a stated level, the calculated value is more than table values, the difference between theoretical and observed frequencies are considered to be significant. It could not have arisen due to fluctuation of simple sampling. However if the values is less than table value it is not considered as significant, regarded as due to fluctuation of simple sampling and therefore ignored. Condition for applying x2

1) N must be large, say more than 50, to ensure the similarity between theoretically correct distribution and our sampling distribution.

2) no theoretical cell frequency cell frequency should be too small, say less than 5,because that may be over estimation of the value of x2 and may result into rejection of hypotheses. In case we get such frequencies, we should pool them up with the previous or succeeding frequencies. This action is called Yates correction for continuity.
USES OF CHI SQUARE TEST:
1) As a test of independence

The Chi Square Test of Independence tests the association between 2 categorical variables.

Weather two or more attribute are associated or not can be tested by framing a hypothesis and testing it against table value. For example, use of quinine is effective in control of fever or complexions of husband and wives. Consider two variables at the nominal or ordinal levels of measurement. A question of interest is: Are the two variables of interest independent(not related)or are they related (dependent)?

When the variables are independent, we are saying that knowledge of one gives us no information about the other variable. When they are dependent, we are saying that knowledge of one variable is helpful in predicting the value of the other variable. One popular method used to check for independence is the chi-squared test of independence. This version of the chi-squared distribution is a nonparametric procedure whereas in the test of significance about a single population variance it was a parametric procedure. Assumptions:

  • We take a random sample of size n.
  • The variables of interest are nominal or ordinal in nature.
  • Observations are cross classified according to two criteria such that each observation belongs to one and only one level of each criterion.

2) As a test of goodness of fit The Test for independence (one of the most frequent uses of Chi Square) is for testing the null hypothesis that two criteria of classification, when applied to a population of subjects are independent. If they are not independent then there is an association between them. A statistical test in which the validity of one hypothesis is tested without specification of an alternative hypothesis is called a goodness-of-fit test. The general procedure consists in defining a test statistic, which is some function of the data measuring the distance between the hypothesis and the data (in fact, the badness-of-fit), and then calculating the probability of obtaining data which have a still larger value of this test statistic than the value observed, assuming the hypothesis is true. This probability is called the size of the test or confidence level. Small probabilities (say, less than one percent) indicate a poor fit. Especially high probabilities (close to one) correspond to a fit which is too good to happen very often, and may indicate a mistake in the way the test was applied, such as treating data as independent when they are correlated. An attractive feature of the chi-square goodness-of-fit test is that it can be applied to any university distribution for which you can calculate the cumulative distribution function. The chi-square goodness-of-fit test is applied to binned data (i.e., data put into classes). This is actually not a restriction since for non-binned data you can simply calculate a histogram or frequency table before generating the chi-square test. However, the values of the chi-square test statistic are dependent on how the data is binned. Another disadvantage of the chi-square test is that it requires a sufficient sample size in order for the chi-square approximation to be valid.

3) As test of homogeneity: it is an extension of test for independence weather two more independent random samples are drawn from the same population or different population. The Test for Homogeneity answers the proposition that several populations are homogeneous with respect to some characteristic.

Statistics Probability Distribution

 

Enumerate probability distribution. Explain the histogram and probability distribution curve.

Probability distribution curve:
Probability distributions are a fundamental concept in statistics. They are used both on a theoretical level and a practical level.

Some practical uses of probability distributions are:

• To calculate confidence intervals for parameters and to calculate critical regions for hypothesis tests.

• For uni-variate data, it is often useful to determine a reasonable distributional model for the data.

• Statistical intervals and hypothesis tests are often based on specific distributional assumptions. Before computing an interval or test based on a distributional assumption, we need to verify that the assumption is justified for the given data set. In this case, the distribution does not need to be the best-fitting distribution for the data, but an adequate enough model so that the statistical technique yields valid conclusions. • Simulation studies with random numbers generated from using a specific probability distribution are often needed.

The probability distribution of the variable X can be uniquely described by its cumulative distribution function F(x), which is defined by for any x in R.

A distribution is called discrete if its cumulative distribution function consists of a sequence of finite jumps, which means that it belongs to a discrete random variable X: a variable which can only attain values from a certain finite or countable set. A distribution is called continuous if its cumulative distribution function is continuous, which means that it belongs to a random variable X for which Pr[ X = x ] = 0 for all x in R.

Several probability distributions are so important in theory or applications that they have been given specific names:

The Bernoulli distribution, which takes value 1 with probability p and value 0 with probability q = 1 - p.
THE POISSON DISTRIBUTION
In probability theory and statistics, the Poisson distribution is a discrete probability distribution (discovered by Siméon-Denis Poisson (1781–1840) and published, together with his probability theory, in 1838 . N that count, among other things, a number of discrete occurrences (sometimes called "arrivals") that take place during a time-interval of given length. The probability that there are exactly k occurrences (k being a non-negative integer, k = 0, 1, 2, ...) is

where e is the base of the natural logarithm (e = 2.71828...),

k! is the factorial of k,

? is a positive real number, equal to the expected number of occurrences that occur during the given interval. For instance, if the events occur on average every 2 minutes, and you are interested in the number of events occurring in a 10 minute interval, you would use as model a Poisson distribution with ? = 5.
THE NORMAL DISTRIBUTION
The normal or Gaussian distribution is one of the most important probability density functions, not the least because many measurement variables have distributions that at least approximate to a normal distribution. It is usually described as bell shaped, although its exact characteristics are determined by the mean and standard deviation. It arises when the value of a variable is determined by a large number of independent processes. For example, weight is a function of many processes both genetic and environmental. Many statistical tests make the assumption that the data come from a normal distribution.

THE PROBABILITY DISTRIBUTION FUNCTION IS GIVEN BY THE FOLLOWING FORMULA

Where x= value of the continuous random variable

= mean of normal random variable(greek letter ‘mu’)

e= exponential constant=2.7183

= standard deviation of the distribution

= mathematical constant=3.1416

HISTOGRAM AND PROBABILITY DISTRIBUTION curve

Histograms--bar charts in which the area of the bar is proportional to the number of observations having values in the range defining the bar. As an example construct histograms of populations .the population histogram describes the proportion of the population that lies between various limits. It also describes the behavior of individual observations drawn at random from the population, that is, it gives the probability that an individual selected at random from the population will have a value between specified limits. When we're talking about populations and probability, we don't use the words "population histogram". Instead, we refer to probability densities and distribution functions. (However, it will sometimes suit my purposes to refer to "population histograms" to remind you what a density is.) When the area of a histogram is standardized to 1, the histogram becomes a probability density function. The area of any portion of the histogram (the area under any part of the curve) is the proportion of the population in the designated region. It is also the probability that an individual selected at random will have a value in the designated region. For example, if 40% of a population has cholesterol values between 200 and 230 mg/dl, 40% of the area of the histogram will be between 200 and 230 mg/dl. The probability that a randomly selected individual will have a cholesterol level in the range 200 to 230 mg/dl is 0.40 or 40%. Strictly speaking, the histogram is properly a density, which tells you the proportion that lies between specified values. A (cumulative) distribution function is something else. It is a curve whose value is the proportion with values less than or equal to the value on the horizontal axis, as the example to the left illustrates. Densities have the same name as their distribution functions. For example, a bell-shaped curve is a normal density. Observations that can be described by a normal density are said to follow a normal distribution.

Statistics Index Numbers

How do you define “index numbers”? Narrate the nature and types of index number with adequate example. According to Croxton and Cowden index numbers are devices for measuring difference sin the magnitude of a group of related

According to Morris Hamburg “ in its simplest form an index number is nothing more than a relative which express the relationship between two figures, where one figure is used as a base. According to M. L .Berenson and D.M.LEVINE “generally speaking , index number measure the size or magnitude of some object at particular point in time as a percentage of some base or reference object in the past. According to Richard .I. Levin and David S. Rubin” an index number is a measure how much a variable changes over time.

NATURE OF INDEX NUMBER
1) Index numbers are specified average used for comparison in situation where two or more series are expressed in different units or represent different items. E.g. consumer price index representing prices of various items or the index of industrial production representing various commodities produced.

2) Index number measure the net change in a group of related variable over a period of time.

3) Index number measure the effect of change over a period of time, across the range of industries, geographical regions or countries.

4) The consumption of the index number is carefully planned according to the purpose of their computation, collection of data and application of appropriate method, assigning of correct weightages and formula.

TYPES OF INDEX NUMBERS:
Price index numbers: A price index is any single number calculated from an array of prices and quantities over a period. Since not all prices and quantities of purchases can be recorded, a representative sample is used instead.. price are generally represented by p in formulae. These are also expressed as price relative , defined as follows

Price relative=(current years price/base years price)*100

=(p1/p0)*100 any increses in price index amounts to corresponding decreses in purchasing power of the rupees or other affected currency. Quantity index number a quantity index number measures how much the number or quantity of a variable changes over time. Quantities are generally represented as q in formulae. Value index number: a value index number measures changes in total monetary worth, that is, it measure changes in the rupee value of a variable. It combines price and quantity changes to present a more informative index. Composite index number: a single index number may reflect a composite, or group, of changing variable. For instance, the consumer price index measures the general price level for specific goods and service in the economy. These are also known as index numbers. In such cases the price-relative with respect to a selected base are determined separately for each and their statistical average is computed
what are the importance of index numbers used in Indian economy. Explain index numbers of industrial production

IMPORTANCE OF INDEX NUMBERS USED IN INDIAN ECONOMY:

Cost of living index or consumer price index

Cost of living index number or consumer price index, expressed as percentage, measure the relative amount of money necessary to derive equal satisfaction during two periods of time, after taking into consideration the fluctuations of the retail prices of consumer goods during these periods. This index is relevant to that real wages of workers are defined as (actual wages/cost of living index)*100. Generally the list of items consumed varies for different classes of people (rich, middle, class, or the poor) at the same place of residence. Also people of the same class belonging to different geographical regions have different consumer habits. Thus the cost of living index always relates to specific class of people and a specific geographical area, and it help in determining the effect of changes in price on different classes of consumers living in different areas. The process of construction of cost of living index number is as follows:

1) Obtain decision about class of people for whom the index number is to be computed, for instance, the industrial personnel, officers or teachers etc. also decide on the geographical area to be covered.

2) Conduct a family budget inquiry covering the class of people for whom the index number is to be computed. The enquiry should be conducted for the base year by the process of random sampling. This would give information regarding the nature, quality and quantities of commodities consumed by an average family of the class and also the amount spent on different items of consumption.

3) The item on which the information regarding money spent is to be collected are food( rice, wheat, sugar, milk, tea etc) ,clothing, fuel and lighting, housing and miscellaneous items.

4) Collect retail prices in respect of the items from the localities in which the class of people concerned reside, or from the markets where they usually make their purchases.

5) as the relative importance of various items for different classes of people is not the same, the price or price relative are always weighted and therefore, the cost of living index is always a weighted index.

6) The percentage expenditure on an item constitutes the weight of the item and the percentage expenditure in the five groups constitutes the group weight.

7) Separate index number are first of all determined for each of the five major groups, by calculating the weighted average of price-relatives of the selected items in the group.

INDEX NUMBER OF INDUSTRIAL PRODUCTION
The index number of industrial production is designed to measure increase or decrease in the level of industrial production in a given period of time compared to some base periods. Such an index measures changes in the quantities of production and not their values. Data about the level of industrial output in the base period and in the given period is to be collected first under the following heads :

  • Textile industries to include cotton, woolen, silk etc.

  • Mining industries like iron ore, iron, coal, copper, petroleum etc.

  • Metallurgical industries like automobiles, locomotive, aero planes etc

  • Industries subject to excise duties like sugar, tobacco, match etc.

  • Miscellaneous like glass, detergents, chemical, cement etc.

The figure of output for a various industries classifies above are obtained on a monthly, quarterly or yearly basis. Weights are assigned to various industries on the basis of some criteria such as capital invested turnover, net output, production etc. usually the weights in the index are based on the values of net output of different industries. The index of industrial production is obtained by taking the simple mean or geometric mean of relatives. When the simple arithmetic mean is used the formula for constructing the index is as follows.

Index of industrial production = (100/ w)*{ (q1/q0) w}=(100/ w)* I.w

Where q1= quantity produced in a given period

Q0= quantity produced in the base period

W= relative importance of different outputs

I= (q1/q0) =index for respective commodity

Statistics Glossary-I

Attribute :      A qualitative characteristic of an individual which can be expressed numerically is called an attribute. 
Alternative Hypothesis :     It is a researcher's hypothesis. 
Bar chart  :      It is a graphic display of how the data falls into different categories or groups.
Bias  :     Over-estimation of a true value.
Bi-Model :      A frequency curve having two scores of highest frequency of equal values.
Binomial  :      An event having only two possible outcomes, say success and failure.
Bivariate :      Involving two variables.
Box & Whiskers    Central         [Box Plot ]  :     It is a graphical display of data pointing out the symmetry and the tendency
Central Tendency      [Center of location] :      A single value which can be considered as typical or representative of a set of observations and around which the observations can be considered as centered.

 Chi- square  :       It is a non-parametric test used to test the independence of two nominal variables.
Class frequency :     The number of observations that fall into each class.
Class Intervals  :       Groups containing the frequency distributions.
Confidence Interval :  The limits or the range of values, that the population parameter could possess, at a given level of
                                     significance.
Continuous Variable :       A variate capable of assuming all the numerical values in a given range.
Correlation Coefficient :       It is the measure of the degree, or extent to which, two variables possess a linear relationship.
Critical Region  :       The region in which, a Z-score lies and which leads a researcher to reject the set-up null hypothesis.
Critical value  :     The computed value of a statistic which is used as a threshold to determine whether the null-hypothesis will be
                                rejected.
 Compound Probability :     The probability of the occurrence of two events.
Co-variance :     It is the square of the standard deviations.
Cumulative frequency :     The total frequency of all values less than or equal to the upper class boundary of a given class - interval and vice-versa.

Statistics Glossary-II

 

Data :     The numerical information collected of variables. 
Degree of freedom :     The number of classes to which the value can be assigned arbitrarily or at will without violating the restrictions and limitations placed on the numbers of independent constraints in a set of data. 
Dependent :     Events whose occurrence or non-occurence doesn't affect the occurrence of the other event. 
Event Dependent Variable :     variable which is cause or influence by another variable in a given phenomenon. 
Descriptive Statistics :     The numerical data which describes phenomena. 
Directional Test :     A test, used to compare two statistical values and predict that one is higher than the other or vice-versa. 
Discrete Variable :     A variable not capable of assuming all values in a given range or a variable which can be measured only by means of whole numbers. 
Disjoint Occurrence :     Two outcomes which doesn't happen simultaneously or which have nothing in common. 
Distribution :     Collection of measurements expressing how scores tend to spread over a measurement scale 
Dispersion :     The scatter or variability of the data about a given Central tendency. 
Frequency :     A diagram which displays the number of measures falling into different Histogram classes.
Frequency polygon :     It is a graphic display in which frequencies are plotted against mid-points of the class-intervals & the points thus obtained, are join by a line segment. 
Fiducial limits :     The confidence limits are also known as Fiducial limits [See Confidence intervals]
Favorable Events  :     The trials, which entail the happening of an event, are favorable to the event.
Grouped Data :   A set of values belonging to different groups.
Histogram :   Graphic display of the frequency of a phenomenon.
Independent Variable :   A variable that causes or influences another variable.
Independent Event :   An event, whose occurrence or non-occurence, doesn't effect the occurrence of the other event in any way. 
Inference :   Conclusion about a population parameter based upon the analysis of a sample statistic (sample being drawn from same population). 
Inter-Quartile Range :   It is the difference between the upper (Q3) and the lower quartile (Q1) inclusive. 
Interval :   A scale, which uses numbers to rank order. 
Intercept :   The value of the ordinate (Y) at which a straight line crosses the vertical axis.

Statistics Glossary-III

 

Joint Occurrence :   An occurrence in which two outcomes happen simultaneously (ABor A Ç B). 
Kurtosis :   It is the degree of flatness or peakedness, in the region of the mode of frequency curve. 
Large sample :   A sample whose size is above 30. 
Least squares :   Any line or curve fitting model, that minimizes the squared distance of the data points to the line.
Leptokurtic :   If the curve is more peaked than the normal curve it is called Leptokurtic.
Lower Quartile (Q1) :   It is the size of the 25th observation when the data is arranged in ascending or descending order or the 25th percentile of a set of measures.
Lines of Regression :   In the scatter plot, if the variables are highly correlated then the dots lie in a narrow strip. If the strip is nearly a straight line then it is called a line of regression.
Level of Significance :   The probability level below which we reject the hypothesis.
Mean :   It is the some of the measures in a distribution by their number.
 Mathematical Expectations :   The sum of the products of values of a variable and their respective probabilities
Measures of Central Tendency :   The descriptive measures which indicate the centered values of a set of observations.
Measure of variation :   It is the descriptive measures which points out the spread of values in a set of values.
Median (Q2) :   The value or the size of the central item of the arranged data or the middle i.e. 50th percentile of the ordered distribution. 
Mode :   It is the size of the item which occurs most frequently in a distribution.
Mutually Exclusive :   Outcomes such that the occurrence of one preclude the occurrence of the other.
Moments :   The arithmetic mean of the various powers of the deviations in any distribution.
Negative Relationship :   In a relationship between two variables when one increases the other decreases or vice - versa.
Nominal :   A scale using numbers, symbols, or titles to designate the different sub-class.
Normal Distribution :   It is the limiting form of the binomial distribution when the number of trials is very large and the probability of success and failure is very small. 
Non-parametric Test :   Statistical test used, when the population cannot be assume to be normal or when the level of measurements is ordinal or less. 
Null- Hypothesis :   The opposite or reverse of the researcher's hypothesis.

Statistics Glossary-IV

 

Ogive :   A graphic representation, that displays a running total. 
One-Tail Test (One sided Test) :   A test that predicts that one value is higher than the other. 
Ordinal :   It is a scale which uses numbers or symbols to rank the intervals are unspecified.
Out-lier :   The points of the data, that fall far away from most of the other points of the data.
Parameter :   A characteristic of population.
Percentile :   It is a value in an order set of measurement, that is calculated on the basis of percentage.
Pie- chart :   It is a circular diagram which is a circle (pie) divided by radii (like slices of a cake or pie)
Platy kurtic :   If the curve is flat-topped when compared to the normal curve then it is a platy kurtic curve.
Point Estimate :   A number computed from a sample, representing a population parameter.
Population :   A group of phenomena having something in common.
Positive Correlation :   A relationship between two variables such that when one increases the other also increases or when one decreases the other also decreases.
Power :   When a hypothesis is fall, the probability that a test will reject the null hypothesis is called Power.
Probability :   A quantity measure of the chances of an outcome or outcomes of a random experiment.
Probability - Distribution :   An unbroken ( smooth ) curve which indicates the frequency distribution of a continuous random variable. 
Random Error :   Error that occur as a result of sampling variability.
Random Sampling :   It is the selection of individuals from the population in such a way that each individual of the population has the same chance of being selected i.e. a sample so selected must be a true representative of its population. This process of sampling is called random sampling.
Range :   The difference between the largest and the smallest value of a set.
Region of acceptance :   It is the area under a probability curve, in which a computed test statistics will lead to the acceptance of the null hypothesis.
Region of Rejection :   Area under the probability (normal) curve in which a computed test statistics will lead to the rejection of the null hypothesis. 
Regression :   The estimation of the linear dependent of one or more independent variables on a dependent variable.
Relative frequency :   The ratio of a class frequency to the total frequency.
Research Hypothesis :   The expectation or prediction, that is to be tested by a researchers.
Residual :   The vertical distance (deviation) between a pre-assigned value of y and its actual value.

Statistics Glossary-V

 

Sample :   A finite set of objects, drawn from the population with the aim that it represents the population. 
Sampling Distribution :   The distribution obtained by computing a statistic for a large number of sample drawn from the same sample population. 
Scatter Diagram :   If a graphic display used to explain the degree of correlation between two variables, by the means of points or dots. 
Skewed :   A distribution displayed at one end of the scale with its tail strung out at the other end. 
Standard - Deviation :   It is the square root of the arithmetic mean of the square of deviations of various values from their arithmetic mean. 
Standard error :   A measure of variation in random of a statistic standardize the conversion into a z-score. 
Statistical Significance :   The probability of obtaining a given result by chance. 
Sample space :   The totality of all outcomes as a result of a random experiment. 
Statistic :   It is a branch of mathematics that describes the aggregate of facts, affected to a marked extent by multiplicity of causes, numerically expressed, in numerated as estimated according to reasonable standard accuracy, collected in a systematic manner for a pre-determined purpose and placed in a relationship to each other. 
Symmetric :   A shape in which one side is the mirror image of other. 
Systematic Error :   The consistency in under-estimating or over-estimating a true value. 
T- Distribution :   A probability distribution used when the standard deviation of the population is unknown and the sample size is small. 
Test statistic :   A computed statistic value used to decide a hypothesis test. 
Two-Tail-Test :   A state of predictions that indicate whether the two values are equal or not equal. 
Type-I-Error ( a Error ) :   Rejecting a null-hypothesis when it is true. 
Type-II-Error ( b Error ) :   Failing to reject the null-hypothesis when it is false. 
Trial :   A procedure or an experiment to collect any statistical data. 
Upper Quartile (Q3) :   The 75th percentile of the set of observation. 
Unbiased Estimator :   The expected value of statistic which is equal to the corresponding population parameter. 
Variable :   As observed characteristic of phenomenon which is to be studied. 
Variance :   The square of the standard deviation of the score about the mean distribution.
Z-score :   The standardized normal variates which are obtained by subtracting the mean and dividing by the standard deviation.

More useful resources:

1. Presentation of Data
2. Measures of Central Tendency
3. Measures of Dispersion
4. Correlation and Regression
5. Probability
6. Index Numbers
7. Miscellaneous Examples
8. FORTRAN
9. Papers