Posted by: Gary Ernest Davis on: September 13, 2013
It is a Data Scientist,
And he stoppeth one of three.
`By thy Python code and glittering eye,
Now wherefore stopp’st thou me?
The classroom doors are opened wide,
And I am next one in;
The others are met, the test is set:
Mayst hear the noisy din.’
He holds him with his skinny hand,
“There was a cluster,” quoth he.
`Hold off! unhand me, open-source loon!’
Eftsoons his hand dropped he.
He holds him with his glittering eye –
The student stood quite still,
And listens like a three years’ child:
The Scientist hath his will.
The student sat upon a stone:
He cannot sort his list;
And thus spake on the young person,
The Data Scientist.
“The code was cleared, the whole team cheered,
Merrily did we drop
Unto the pub, and there to drink,
Without a thought to stop.
The variables were writ upon the left,
Transferred from R to C;
The code shone bright, and on the right
The data a, b, c.
More and more code every day,
It was a wondrous thing –
The student here did beat his breast,
For he heard the exam bell ring.
The examiner hath paced into the hall,
Red of face is he;
Nodding his head from side to side –
A fan of Scotch whisky.
The student he did beat his breast,
He forgets to sort his list;
And thus spake on the young man,
The Data Scientist.
“And now the data surge came, and it
Was tyrannous and strong:
It struck with massive overload,
And analysis took so long.
With high performance really stretched,
As who pursued with yell and blow
Still treads the shadow of his foe,
And foward bends his head,
The cluster was fast, it was a blast,
And onward aye we sped.
And now there were missing values and outliers,
And it grew wondrous confused:
And deleted columns, as if floating by,
Their data could not be used.
And through the drifts the snowy clifts
Did send a dismal sheen:
Nor shapes of men nor beasts we ken –
The grep was all between.
The grep was here, the grep was there,
The grep was all around:
It cracked and growled, and roared and howled,
Regular expressions could not be found!
At length did cross a statistician,
Thorough the fog she came;
As she had been a blessed soul,
We hailed her in Tukey’s name.
She saw the data we ne’er had seen,
And all around she went.
The data did split with a thunder-fit;
The programmers steered us through!
And a good data stream sprung up behind;
The statistician did follow,
And every day, for data or play,
Came to the programmer’s hollo!
In hard-drive or cloud, whatever’s allowed,
She analyzed the data mine;
Whiles all the night, with code writ right,
The programmers drank moonshine.”
`God save thee, Data Scientisit,
From the fiends that plague thee thus! –
Why look’st thou perchance?’ – “With a wicked glance,
I fired the statistician.”
Posted by: Gary Ernest Davis on: June 25, 2013
Surely the leading (= left-most) digit of a positive integer is an obvious thing? Just stare at the integer (e.g. 7823) and observe the left-most digit (7, and
Suppose, clinic however, that you wanted to find the leading digit of a very large list of positive integers, a list so large it was hard to impossible to peruse by eye? How could you write an algorithm to compute the leading digits? Even more, suppose you wanted to come up with a mathematical argument that involved determining the leading digit of an otherwise unspecified positive integer?
In a short and lovely mathematical argument, Dave Radcliffe (@daveinstpaul) proves that there are exactly 18266 distinct ordered lists of values
(leading digit of 2n, … , leading digit of 9n)
as n ranges over the infinite set of positive integers.
A key part of his argument is that the leading digit of an is completely determined by the fractional part of n×log10(a).
How might we see this?
Let’s make a table of values and see if something pops out:
k | fractional part of log10(k) |
1 | 0. |
2 | 0.30103 |
3 | 0.477121 |
4 | 0.60206 |
5 | 0.69897 |
6 | 0.778151 |
7 | 0.845098 |
8 | 0.90309 |
9 | 0.954243 |
10 | 0. |
11 | 0.0413927 |
12 | 0.0791812 |
13 | 0.113943 |
14 | 0.146128 |
15 | 0.176091 |
16 | 0.20412 |
17 | 0.230449 |
18 | 0.255273 |
19 | 0.278754 |
20 | 0.30103 |
Nothing obvious, so what does Dave Radcliffe mean by “the leading digit of an is completely determined by the fractional part of n×log10(a)�
Let’s think about how we can algorithmically determine the leading digit of an integer written base 10.
Suppose k is a positive integer that is of the form apap-1…a1a0 base 10.
That is, the ai are digits base 10 (i.e. one of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9) and ap is not 0, because it is the leading digit.
What every school child does not immediately recall is that this means
k = ap10p + ap-110p-1 +… + a110 + a0
So 10p is no bigger than k, which in turn is less than 10p+1 :
10p ≤ k < 10p+1
Therefore,
log10(10p ) ≤ log10(k) < log10(10p+1)
because the logarithm is an increasing function.
In other words,
p ≤ log10(k) < p+1
 which means that p is the greatest integer less than or equal to log10(k) – that is the floor of log10(k): p=Floor[log10(k)].
Now if we divide k by 10p we get:
k/10p = ap + 0.ap-…a1a0 (base 10)
which means ap = Floor[k/10p] = Floor[k/10 Floor[log10(k)] ]
We can express Floor[log10(k)] in terms of the fractional part { log10(k)} of log10(k) simply as
 Floor[log10(k)] = log10(k) – { log10(k)}
So,
10 Floor[log10(k)] = 10log10(k) – {log10(k}} = k/10{log10(k}}
which, upon substituting into the expression above for ap, gives:
ap =Floor[10{log10(k}}]
This is the precise sense in which the leading digit, ap, of k is determined by the fractional part {log10(k} of log10(k).
When k= an, this is just the fractional part of n×log10(a).
Going back to the table above, and including a third column of Floor[10{log10(k}}], we get:
k | fractional part of log10(k) | Floor[10{log10(k}}] |
1 | 0. | 1 |
2 | 0.30103 | 2 |
3 | 0.477121 | 3 |
4 | 0.60206 | 4 |
5 | 0.69897 | 5 |
6 | 0.778151 | 6 |
7 | 0.845098 | 7 |
8 | 0.90309 | 8 |
9 | 0.954243 | 9 |
10 | 0. | 1 |
11 | 0.0413927 | 1 |
12 | 0.0791812 | 1 |
13 | 0.113943 | 1 |
14 | 0.146128 | 1 |
15 | 0.176091 | 1 |
16 | 0.20412 | 1 |
17 | 0.230449 | 1 |
18 | 0.255273 | 1 |
19 | 0.278754 | 1 |
20 | 0.30103 | 2 |
Or, if we should do this for k = 2n for varying n, we get a table that begins as follows:
n | 2n | fractional part of n ×log10(2) | Floor[10{n×log10(2}}] |
1 | 2 | 0.30103 | 2 |
2 | 4 | 0.60206 | 4 |
3 | 8 | 0.90309 | 8 |
4 | 16 | 0.20412 | 1 |
5 | 32 | 0.50515 | 3 |
6 | 64 | 0.80618 | 6 |
7 | 128 | 0.10721 | 1 |
8 | 256 | 0.40824 | 2 |
9 | 512 | 0.70927 | 5 |
10 | 1024 | 0.0103 | 1 |
11 | 2048 | 0.31133 | 2 |
12 | 4096 | 0.61236 | 4 |
13 | 8192 | 0.91339 | 8 |
14 | 16384 | 0.21442 | 1 |
15 | 32768 | 0.51545 | 3 |
16 | 65536 | 0.81648 | 6 |
17 | 131072 | 0.11751 | 1 |
18 | 262144 | 0.41854 | 2 |
19 | 524288 | 0.71957 | 5 |
20 | 1048576 | 0.0205999 | 1 |
in precise agreement with Dave Radcliffe’s assertion.