Age | Commit message (Collapse) | Author |
|
camellia cipher
Patch adds AVX2/AES-NI/x86-64 implementation of Camellia cipher, requiring
32 parallel blocks for input (512 bytes). Compared to AVX implementation, this
version is extended to use the 256-bit wide YMM registers. For AES-NI
instructions data is split to two 128-bit registers and merged afterwards.
Even with this additional handling, performance should be higher compared
to the AES-NI/AVX implementation.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Patch adds AVX2/x86-64 implementation of Serpent cipher, requiring 16 parallel
blocks for input (256 bytes). Implementation is based on the AVX implementation
and extends to use the 256-bit wide YMM registers. Since serpent does not use
table look-ups, this implementation should be close to two times faster than
the AVX implementation.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Patch adds AVX2/x86-64 implementation of Twofish cipher, requiring 16 parallel
blocks for input (256 bytes). Table look-ups are performed using vpgatherdd
instruction directly from vector registers and thus should be faster than
earlier implementations. Implementation also uses 256-bit wide YMM registers,
which should give additional speed up compared to the AVX implementation.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Patch adds AVX2/x86-64 implementation of Blowfish cipher, requiring 32 parallel
blocks for input (256 bytes). Table look-ups are performed using vpgatherdd
instruction directly from vector registers and thus should be faster than
earlier implementations.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Add more optimized XTS code for aesni_intel in 64-bit mode, for smaller stack
usage and boost for speed.
tcrypt results, with Intel i5-2450M:
256-bit key
enc dec
16B 0.98x 0.99x
64B 0.64x 0.63x
256B 1.29x 1.32x
1024B 1.54x 1.58x
8192B 1.57x 1.60x
512-bit key
enc dec
16B 0.98x 0.99x
64B 0.60x 0.59x
256B 1.24x 1.25x
1024B 1.39x 1.42x
8192B 1.38x 1.42x
I chose not to optimize smaller than block size of 256 bytes, since XTS is
practically always used with data blocks of size 512 bytes. This is why
performance is reduced in tcrypt for 64 byte long blocks.
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Add more optimized XTS code for camellia-aesni-avx, for smaller stack usage
and small boost for speed.
tcrypt results, with Intel i5-2450M:
enc dec
16B 1.10x 1.01x
64B 0.82x 0.77x
256B 1.14x 1.10x
1024B 1.17x 1.16x
8192B 1.10x 1.11x
Since XTS is practically always used with data blocks of size 512 bytes or
more, I chose to not make use of camellia-2way for block sized smaller than
256 bytes. This causes slower result in tcrypt for 64 bytes.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Change cast6-avx to use the new XTS code, for smaller stack usage and small
boost to performance.
tcrypt results, with Intel i5-2450M:
enc dec
16B 1.01x 1.01x
64B 1.01x 1.00x
256B 1.09x 1.02x
1024B 1.08x 1.06x
8192B 1.08x 1.07x
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Change twofish-avx to use the new XTS code, for smaller stack usage and small
boost to performance.
tcrypt results, with Intel i5-2450M:
enc dec
16B 1.03x 1.02x
64B 0.91x 0.91x
256B 1.10x 1.09x
1024B 1.12x 1.11x
8192B 1.12x 1.11x
Since XTS is practically always used with data blocks of size 512 bytes or
more, I chose to not make use of twofish-3way for block sized smaller than
128 bytes. This causes slower result in tcrypt for 64 bytes.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
This patch adds AVX optimized XTS-mode helper functions/macros and converts
serpent-avx to use the new facilities. Benefits are slightly improved speed
and reduced stack usage as use of temporary IV-array is avoided.
tcrypt results, with Intel i5-2450M:
enc dec
16B 1.00x 1.00x
64B 1.00x 1.00x
256B 1.04x 1.06x
1024B 1.09x 1.09x
8192B 1.10x 1.09x
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Occurs when CONFIG_CRYPTO_CRC32C_INTEL=y and CONFIG_CRYPTO_CRC32C_INTEL=y.
Older versions of bintuils do not support the pclmulqdq instruction. The
PCLMULQDQ gas macro is used instead.
Signed-off-by: Sandy Wu <sandyw@twitter.com>
Cc: stable@vger.kernel.org # 3.8+
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
SSSE3, AVX or AVX2 instructions.
We added glue code and config options to create crypto
module that uses SSE/AVX/AVX2 optimized SHA512 x86_64 assembly routines.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
instruction.
Provides SHA512 x86_64 assembly routine optimized with SSE, AVX and
AVX2's RORX instructions. Speedup of 70% or more has been
measured over the generic implementation.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
instructions.
Provides SHA512 x86_64 assembly routine optimized with SSE and AVX instructions.
Speedup of 60% or more has been measured over the generic implementation.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
SSE3 instructions.
Provides SHA512 x86_64 assembly routine optimized with SSSE3 instructions.
Speedup of 40% or more has been measured over the generic implementation.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
SSSE3, AVX or AVX2 instructions.
We added glue code and config options to create crypto
module that uses SSE/AVX/AVX2 optimized SHA256 x86_64 assembly routines.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Provides SHA256 x86_64 assembly routine optimized with SSE, AVX and
AVX2's RORX instructions. Speedup of 70% or more has been
measured over the generic implementation.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Provides SHA256 x86_64 assembly routine optimized with SSE and AVX instructions.
Speedup of 60% or more has been measured over the generic implementation.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
SSE3 instructions.
Provides SHA256 x86_64 assembly routine optimized with SSSE3 instructions.
Speedup of 40% or more has been measured over the generic implementation.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
supports AVX instructions
These modules require AVX support in assembler, so add new check to Makefile
for this.
Other option would be to use CONFIG_AS_AVX inside source files, but that would
result dummy/empty/no-fuctionality modules being created.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
with PCLMULQDQ instructions.
Herbert,
The following patch update the stale link to the CRC32C white paper
that was referenced.
Tim
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
This bool option can never be set to anything other than y. So
let's just kill it.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Pull crypto update from Herbert Xu:
"Here is the crypto update for 3.9:
- Added accelerated implementation of crc32 using pclmulqdq.
- Added test vector for fcrypt.
- Added support for OMAP4/AM33XX cipher and hash.
- Fixed loose crypto_user input checks.
- Misc fixes"
* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (43 commits)
crypto: user - ensure user supplied strings are nul-terminated
crypto: user - fix empty string test in report API
crypto: user - fix info leaks in report API
crypto: caam - Added property fsl,sec-era in SEC4.0 device tree binding.
crypto: use ERR_CAST
crypto: atmel-aes - adjust duplicate test
crypto: crc32-pclmul - Kill warning on x86-32
crypto: x86/twofish - assembler clean-ups: use ENTRY/ENDPROC, localize jump labels
crypto: x86/sha1 - assembler clean-ups: use ENTRY/ENDPROC
crypto: x86/serpent - use ENTRY/ENDPROC for assember functions and localize jump targets
crypto: x86/salsa20 - assembler cleanup, use ENTRY/ENDPROC for assember functions and rename ECRYPT_* to salsa20_*
crypto: x86/ghash - assembler clean-up: use ENDPROC at end of assember functions
crypto: x86/crc32c - assembler clean-up: use ENTRY/ENDPROC
crypto: cast6-avx: use ENTRY()/ENDPROC() for assembler functions
crypto: cast5-avx: use ENTRY()/ENDPROC() for assembler functions and localize jump targets
crypto: camellia-x86_64/aes-ni: use ENTRY()/ENDPROC() for assembler functions and localize jump targets
crypto: blowfish-x86_64: use ENTRY()/ENDPROC() for assembler functions and localize jump targets
crypto: aesni-intel - add ENDPROC statements for assembler functions
crypto: x86/aes - assembler clean-ups: use ENTRY/ENDPROC, localize jump targets
crypto: testmgr - add test vector for fcrypt
...
|
|
This patch removes a gratuitous warning on x86-32:
arch/x86/crypto/crc32-pclmul_asm.S:87:2: warning: #warning Using 32bit code support [-Wcpp]
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
labels
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
jump targets
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
functions and rename ECRYPT_* to salsa20_*
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Signed-off-by: Jussi Kivilinna <jussi.kivilinn@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
localize jump targets
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
functions and localize jump targets
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
localize jump targets
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
implementation
This patch adds crc32 algorithms to shash crypto api. One is wrapper to
gerneric crc32_le function. Second is crc32 pclmulqdq implementation. It
use hardware provided PCLMULQDQ instruction to accelerate the CRC32 disposal.
This instruction present from Intel Westmere and AMD Bulldozer CPUs.
For intel core i5 I got 450MB/s for table implementation and 2100MB/s
for pclmulqdq implementation.
Signed-off-by: Alexander Boyko <alexander_boyko@xyratex.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
ctr-module instead
rfc3686 in CTR module is now able of using asynchronous ctr(aes) from
aesni-intel, so rfc3686(ctr(aes)) in aesni-intel is no longer needed.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
Pull crypto update from Herbert Xu:
- Added aesni/avx/x86_64 implementations for camellia.
- Optimised AVX code for cast5/serpent/twofish/cast6.
- Fixed vmac bug with unaligned input.
- Allow compression algorithms in FIPS mode.
- Optimised crc32c implementation for Intel.
- Misc fixes.
* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (32 commits)
crypto: caam - Updated SEC-4.0 device tree binding for ERA information.
crypto: testmgr - remove superfluous initializers for xts(aes)
crypto: testmgr - allow compression algs in fips mode
crypto: testmgr - add larger crc32c test vector to test FPU path in crc32c_intel
crypto: testmgr - clean alg_test_null entries in alg_test_descs[]
crypto: testmgr - remove fips_allowed flag from camellia-aesni null-tests
crypto: cast5/cast6 - move lookup tables to shared module
padata: use __this_cpu_read per-cpu helper
crypto: s5p-sss - Fix compilation error
crypto: picoxcell - Add terminating entry for platform_device_id table
crypto: omap-aes - select BLKCIPHER2
crypto: camellia - add AES-NI/AVX/x86_64 assembler implementation of camellia cipher
crypto: camellia-x86_64 - share common functions and move structures and function definitions to header file
crypto: tcrypt - add async speed test for camellia cipher
crypto: tegra-aes - fix error-valued pointer dereference
crypto: tegra - fix missing unlock on error case
crypto: cast5/avx - avoid using temporary stack buffers
crypto: serpent/avx - avoid using temporary stack buffers
crypto: twofish/avx - avoid using temporary stack buffers
crypto: cast6/avx - avoid using temporary stack buffers
...
|
|
CAST5 and CAST6 both use same lookup tables, which can be moved shared module
'cast_common'.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
camellia cipher
This patch adds AES-NI/AVX/x86_64 assembler implementation of Camellia block
cipher. Implementation process data in sixteen block chunks, which are
byte-sliced and AES SubBytes is reused for Camellia s-box with help of pre-
and post-filtering.
Patch has been tested with tcrypt and automated filesystem tests.
tcrypt test results:
Intel Core i5-2450M:
camellia-aesni-avx vs camellia-asm-x86_64-2way:
128bit key: (lrw:256bit) (xts:256bit)
size ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec lrw-enc lrw-dec xts-enc xts-dec
16B 0.98x 0.96x 0.99x 0.96x 0.96x 0.95x 0.95x 0.94x 0.97x 0.98x
64B 0.99x 0.98x 1.00x 0.98x 0.98x 0.99x 0.98x 0.93x 0.99x 0.98x
256B 2.28x 2.28x 1.01x 2.29x 2.25x 2.24x 1.96x 1.97x 1.91x 1.90x
1024B 2.57x 2.56x 1.00x 2.57x 2.51x 2.53x 2.19x 2.17x 2.19x 2.22x
8192B 2.49x 2.49x 1.00x 2.53x 2.48x 2.49x 2.17x 2.17x 2.22x 2.22x
256bit key: (lrw:384bit) (xts:512bit)
size ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec lrw-enc lrw-dec xts-enc xts-dec
16B 0.97x 0.98x 0.99x 0.97x 0.97x 0.96x 0.97x 0.98x 0.98x 0.99x
64B 1.00x 1.00x 1.01x 0.99x 0.98x 0.99x 0.99x 0.99x 0.99x 0.99x
256B 2.37x 2.37x 1.01x 2.39x 2.35x 2.33x 2.10x 2.11x 1.99x 2.02x
1024B 2.58x 2.60x 1.00x 2.58x 2.56x 2.56x 2.28x 2.29x 2.28x 2.29x
8192B 2.50x 2.52x 1.00x 2.56x 2.51x 2.51x 2.24x 2.25x 2.26x 2.29x
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
function definitions to header file
Prepare camellia-x86_64 functions to be reused from AVX/AESNI implementation
module.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Introduce new assembler functions to avoid use temporary stack buffers in glue
code. This also allows use of vector instructions for xoring output in CTR and
CBC modes and construction of IVs for CTR mode.
ECB mode sees ~0.5% decrease in speed because added one extra function
call. CBC mode decryption and CTR mode benefit from vector operations
and gain ~5%.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Introduce new assembler functions to avoid use temporary stack buffers in glue
code. This also allows use of vector instructions for xoring output in CTR and
CBC modes and construction of IVs for CTR mode.
ECB mode sees ~0.5% decrease in speed because added one extra function
call. CBC mode decryption and CTR mode benefit from vector operations
and gain ~3%.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Introduce new assembler functions to avoid use temporary stack buffers in glue
code. This also allows use of vector instructions for xoring output in CTR and
CBC modes and construction of IVs for CTR mode.
ECB mode sees ~0.2% decrease in speed because added one extra function
call. CBC mode decryption and CTR mode benefit from vector operations
and gain ~3%.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Introduce new assembler functions to avoid use temporary stack buffers in
glue code. This also allows use of vector instructions for xoring output
in CTR and CBC modes and construction of IVs for CTR mode.
ECB mode sees ~0.5% decrease in speed because added one extra function
call. CBC mode decryption and CTR mode benefit from vector operations
and gain ~2%.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
'u128' currently used for CTR mode is on little-endian 'long long' swapped
and would require extra swap operations by SSE/AVX code. Use of le128
instead of u128 allows IV calculations to be done with vector registers
easier.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
aesni_enc()
Calling convention for internal functions and 'asmlinkage' functions is
different on x86-32. Therefore do not directly cast aesni_enc as XTS tweak
function, but use wrapper function in between. Fixes crash with "XTS +
aesni_intel + x86-32" combination.
Cc: stable@vger.kernel.org
Reported-by: Krzysztof Kolasa <kkolasa@winsoft.pl>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This patch adds the crc_pcl function that calculates CRC32C checksum using the
PCLMULQDQ instruction on processors that support this feature. This will
provide speedup over using CRC32 instruction only.
The usage of PCLMULQDQ necessitate the invocation of kernel_fpu_begin and
kernel_fpu_end and incur some overhead. So the new crc_pcl function is only
invoked for buffer size of 512 bytes or more. Larger sized
buffers will expect to see greater speedup. This feature is best used coupled
with eager_fpu which reduces the kernel_fpu_begin/end overhead. For
buffer size of 1K the speedup is around 1.6x and for buffer size greater than
4K, the speedup is around 3x compared to original implementation in crc32c-intel
module. Test was performed on Sandy Bridge based platform with constant frequency
set for cpu.
A white paper detailing the algorithm can be found here:
http://download.intel.com/design/intarch/papers/323405.pdf
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
This patch renames the crc32c-intel.c file to crc32c-intel_glue.c file
in preparation for linking with the new crc32c-pcl-intel-asm.S file,
which contains optimized crc32c calculation based on PCLMULQDQ
instruction.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|