Discussion:
[julia-dev] JIT performance optimization[ maybe ] - unroll any loop at-least once ..
Tsur Herman
2016-09-26 07:23:21 UTC
Permalink
I am probably re-discovering the wheel or about to say something that will
get a response in the form of "its complicated"

It occurred to me that what we need to optimize are loops and iterations.

The following code will run slow:

function F1(A)
total = 0;
for a in A
total += a;
end
total
end

whereas the following code will run fast

function F2(A)
total = zero(eltype(A));
for a in A
total += a;
end
total
end
julia> A=rand(4000,4000);

julia> F1(A);@time F1(A)
0.362453 seconds (48.00 M allocations: 732.422 MB, 5.72% gc time)
7.997862445435431e6

julia> F2(A);@time F2(A)
0.025840 seconds (5 allocations: 176 bytes)
7.997862445435431e6


because in the first version total starts in a different type than the
element of the matrix A . and there is now way for the JITter to tell if
this is deliberate or careless coding .
What I suggest is to *replace any loop in the code with an at-least one
unrolled version and to push the rest into an anonymous function.*
In the parser level.

for example:

function FF(A)
total = 0

loopbody(total,a) = return total += a

total = loopbody(total,A[1])

lambda(A,range,total,loopbody) = begin
@simd for i in range
@inbounds total = loopbody(total,A[i])
end
total
end

return lambda(A,2:prod(size(A)),total,loopbody)
end


And the timing results compared to builtin sum function:

julia> FF(A);@time FF(A)
0.010596 seconds (7 allocations: 208 bytes)
7.997862445434973e6

julia> sum(A);@time sum(A)
0.010666 seconds (5 allocations: 176 bytes)
7.997862445434952e6
Yichao Yu
2016-09-26 12:15:48 UTC
Permalink
Post by Tsur Herman
I am probably re-discovering the wheel or about to say something that will
get a response in the form of "its complicated"
It occurred to me that what we need to optimize are loops and iterations.
function F1(A)
total = 0;
for a in A
total += a;
end
total
end
whereas the following code will run fast
function F2(A)
total = zero(eltype(A));
for a in A
total += a;
end
total
end
julia> A=rand(4000,4000);
0.362453 seconds (48.00 M allocations: 732.422 MB, 5.72% gc time)
7.997862445435431e6
0.025840 seconds (5 allocations: 176 bytes)
7.997862445435431e6
because in the first version total starts in a different type than the
element of the matrix A . and there is now way for the JITter to tell if
this is deliberate or careless coding .
What I suggest is to replace any loop in the code with an at-least one
unrolled version and to push the rest into an anonymous function.
In the parser level.
function FF(A)
total = 0
loopbody(total,a) = return total += a
total = loopbody(total,A[1])
lambda(A,range,total,loopbody) = begin
@simd for i in range
@inbounds total = loopbody(total,A[i])
end
total
end
return lambda(A,2:prod(size(A)),total,loopbody)
end
0.010596 seconds (7 allocations: 208 bytes)
7.997862445434973e6
0.010666 seconds (5 allocations: 176 bytes)
7.997862445434952e6
This is the henchmen unrolling in https://github.com/JuliaLang/julia/issues/3440
Jeff Bezanson
2016-09-27 19:32:20 UTC
Permalink
Just a footnote that the term "henchmen unrolling" makes no sense, and
I believe originates from an iphone autocorrect in an email by Stefan.
We found it funny so we just kept it. Hopefully this term will
eventually appear in a textbook somewhere :)
Post by Yichao Yu
Post by Tsur Herman
I am probably re-discovering the wheel or about to say something that will
get a response in the form of "its complicated"
It occurred to me that what we need to optimize are loops and iterations.
function F1(A)
total = 0;
for a in A
total += a;
end
total
end
whereas the following code will run fast
function F2(A)
total = zero(eltype(A));
for a in A
total += a;
end
total
end
julia> A=rand(4000,4000);
0.362453 seconds (48.00 M allocations: 732.422 MB, 5.72% gc time)
7.997862445435431e6
0.025840 seconds (5 allocations: 176 bytes)
7.997862445435431e6
because in the first version total starts in a different type than the
element of the matrix A . and there is now way for the JITter to tell if
this is deliberate or careless coding .
What I suggest is to replace any loop in the code with an at-least one
unrolled version and to push the rest into an anonymous function.
In the parser level.
function FF(A)
total = 0
loopbody(total,a) = return total += a
total = loopbody(total,A[1])
lambda(A,range,total,loopbody) = begin
@simd for i in range
@inbounds total = loopbody(total,A[i])
end
total
end
return lambda(A,2:prod(size(A)),total,loopbody)
end
0.010596 seconds (7 allocations: 208 bytes)
7.997862445434973e6
0.010666 seconds (5 allocations: 176 bytes)
7.997862445434952e6
This is the henchmen unrolling in https://github.com/JuliaLang/julia/issues/3440
Stefan Karpinski
2016-09-27 21:13:23 UTC
Permalink
I believe the actual term for this kind of optimization is "loop peeling",
which makes much more sense. I'm probably still gonna call it henchmen
unrolling though. An issue with henchmen unrolling is that while it may
improve type inference for the loop body, it often doesn't improve the
inferred type for the entire function. The classic example is the
type-unstable summation loop:

function sumit(v::Vector{Float64})
s = 0
for x in v
s += x
end
return s
end


Henchmen unrolling this makes the loop body type-stable – s is Float64 –
but the function can still return an Int in the case where v is empty, so
the function remains type unstable. I guess that a return type annotation
plus henchmen unrolling does fix the problem, but it seems better to use
static analysis to uncover this to the programmer so they know to write s =
0.0 instead.
Post by Jeff Bezanson
Just a footnote that the term "henchmen unrolling" makes no sense, and
I believe originates from an iphone autocorrect in an email by Stefan.
We found it funny so we just kept it. Hopefully this term will
eventually appear in a textbook somewhere :)
Post by Yichao Yu
Post by Tsur Herman
I am probably re-discovering the wheel or about to say something that
will
Post by Yichao Yu
Post by Tsur Herman
get a response in the form of "its complicated"
It occurred to me that what we need to optimize are loops and
iterations.
Post by Yichao Yu
Post by Tsur Herman
function F1(A)
total = 0;
for a in A
total += a;
end
total
end
whereas the following code will run fast
function F2(A)
total = zero(eltype(A));
for a in A
total += a;
end
total
end
julia> A=rand(4000,4000);
0.362453 seconds (48.00 M allocations: 732.422 MB, 5.72% gc time)
7.997862445435431e6
0.025840 seconds (5 allocations: 176 bytes)
7.997862445435431e6
because in the first version total starts in a different type than the
element of the matrix A . and there is now way for the JITter to tell if
this is deliberate or careless coding .
What I suggest is to replace any loop in the code with an at-least one
unrolled version and to push the rest into an anonymous function.
In the parser level.
function FF(A)
total = 0
loopbody(total,a) = return total += a
total = loopbody(total,A[1])
lambda(A,range,total,loopbody) = begin
@simd for i in range
@inbounds total = loopbody(total,A[i])
end
total
end
return lambda(A,2:prod(size(A)),total,loopbody)
end
0.010596 seconds (7 allocations: 208 bytes)
7.997862445434973e6
0.010666 seconds (5 allocations: 176 bytes)
7.997862445434952e6
This is the henchmen unrolling in https://github.com/JuliaLang/j
ulia/issues/3440
Milan Bouchet-Valat
2016-09-27 21:29:29 UTC
Permalink
Post by Stefan Karpinski
I believe the actual term for this kind of optimization is "loop
peeling", which makes much more sense. I'm probably still gonna call
it henchmen unrolling though. An issue with henchmen unrolling is
that while it may improve type inference for the loop body, it often
doesn't improve the inferred type for the entire function. The
function sumit(v::Vector{Float64})
    s = 0
    for x in v
        s += x
    end
    return s
end
 
Henchmen unrolling this makes the loop body type-stable – s is
Float64 – but the function can still return an Int in the case where
v is empty, so the function remains type unstable. I guess that a
return type annotation plus henchmen unrolling does fix the problem,
but it seems better to use static analysis to uncover this to the
programmer so they know to write s = 0.0 instead.
It would be so nice to be able to tell the compiler to choose a 0 of
whatever type it needs to make the function type stable... These zero()
initialization steps can be so tedious to write fully correctly when
the computation inside the loop is complex. This is true in particular
when it involves a multiplication and an addition, for which custom
types might have arbitrary promotion rules. A poster child for this is
computing the deviance of a linear model:
https://github.com/JuliaStats/GLM.jl/blob/4d9399b559345b3926c74631019d645073b53bb0/src/lm.jl#L31

Any chance we could find a solution to this? Julia is generally smart
enough to infer the best type everywhere it's logically possible, and
this is one of the very few places were it doesn't.


Regards
Post by Stefan Karpinski
On Tue, Sep 27, 2016 at 3:32 PM, Jeff Bezanson
Post by Jeff Bezanson
Just a footnote that the term "henchmen unrolling" makes no sense, and
I believe originates from an iphone autocorrect in an email by Stefan.
We found it funny so we just kept it. Hopefully this term will
eventually appear in a textbook somewhere :)
Post by Yichao Yu
Post by Tsur Herman
I am probably re-discovering the wheel or about to say something
that will
Post by Yichao Yu
Post by Tsur Herman
get a response in the form of "its complicated"
It occurred to me that what we  need to optimize are loops and
iterations.
Post by Yichao Yu
Post by Tsur Herman
function F1(A)
     total = 0;
     for a in A
         total += a;
     end
     total
end
whereas the following code will run fast
function F2(A)
     total = zero(eltype(A));
     for a in A
         total += a;
     end
     total
end
julia> A=rand(4000,4000);
   0.362453 seconds (48.00 M allocations: 732.422 MB, 5.72% gc
time)
Post by Yichao Yu
Post by Tsur Herman
7.997862445435431e6
   0.025840 seconds (5 allocations: 176 bytes)
7.997862445435431e6
because in the first version total starts in a different type
than the
Post by Yichao Yu
Post by Tsur Herman
element of the matrix A . and there is now way for the JITter to
tell if
Post by Yichao Yu
Post by Tsur Herman
this is deliberate or careless coding .
What I suggest is to replace any loop in the code with an at-
least one
Post by Yichao Yu
Post by Tsur Herman
unrolled version and to push the rest into an anonymous
function.
Post by Yichao Yu
Post by Tsur Herman
In the parser level.
function FF(A)
     total = 0
     loopbody(total,a) = return total += a
     total = loopbody(total,A[1])
     lambda(A,range,total,loopbody) = begin
         end
         total
     end
     return lambda(A,2:prod(size(A)),total,loopbody)
end
   0.010596 seconds (7 allocations: 208 bytes)
7.997862445434973e6
   0.010666 seconds (5 allocations: 176 bytes)
7.997862445434952e6
This is the henchmen unrolling in https://github.com/JuliaLang/ju
lia/issues/3440
Tsur Herman
2016-09-28 09:37:56 UTC
Permalink
Stefan wrote:
Henchmen unrolling this makes the loop body type-stable – s is Float64 –
but the function can still return an Int in the case where v is empty,

Maybe a solution to this is to define operations involving empty containers
as a conversion operator. and any iteration over an item from container
will be preceded with an iteration using empty containers.

so
a=[] #type float64
s=0 #type int

s += a => convert(typeof(a),s)

And another thing in that matter :
I think eltype of function should return the inferred type for that
function (applied recursively)
so lets say we have:
function f(x)
y = x[1] + g(x)
end

then
eltype(f,(float64))

would be :

eltype(+,(eltype(x),eltype(g,float64))

I think Base.code_typed already does that although I don't think it is
recursive.

The whole idea came from observing that:
eltype(t^2 for t in rand(10))
Any


eltype([t^2 for t in rand(10)])
Float64

But if I declare
eltype(G::Base.Generator) =
Base.code_typed(G.f,(eltype(G.iter),))[1].rettype

then:
eltype(t^2 for t in rand(10))
Float64

Loading...