commit 8535b3e11d2297854991c4272932ce4974dda629 (HEAD -> master, tag: 0.8.1)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Mar 22 17:42:33 2021 -0500

    Version file update (0.8.1)

commit e56d9f2d94ed247696dda2cbf94d2ca05c7fc089 (origin/master, origin/HEAD)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Mar 22 17:40:50 2021 -0500

    ReleaseNotes.md update in advance of next version.

commit ca83f955d45814b7d84f53933cdb73323c0dea2c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Mar 22 17:21:21 2021 -0500

    CREDITS file update.

commit 57ef61f6cdb86957f67212aa59407f2f8e7f3d1a
Merge: bf1b578e e7a4a8ed
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Mar 19 13:05:43 2021 -0500

    Merge branch 'master' of github.com:flame/blis

commit bf1b578ea32ea1c9dbf7cb3586969e8ae89aa5ef
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Mar 19 13:03:17 2021 -0500

    Reduced KC on skx from 384 to 256.
    
    Details:
    - Reduced the KC cache blocksize for double real on the skx subconfig
      from 384 to 256. The maximum (extended) KC was also reduced
      accordingly from 480 to 320. Thanks to Tze Meng Low for suggesting
      this change.

commit e7a4a8edc940942357e8e4c4594383a29a962f93
Author: Nicholai Tukanov <nicholaitukanov@gmail.com>
Date:   Wed Mar 17 19:43:31 2021 -0500

    Fix calculation of new pb size (#487)
    
    Details:
    - Added missing parentheses to the i8 and i4 instantiations of the
      GENERIC_GEMM macro in sandbox/power10/generic_gemm.c.

commit 4493cf516e01aba82642a43abe350943ba458fe2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Mar 15 13:12:49 2021 -0500

    Redefined BLIS_NUM_ARCHS to update automatically.
    
    Details:
    - Changed BLIS_NUM_ARCHS from a cpp macro definition to the last enum
      value in the arch_t enum. This means that it no longer needs to get
      updated manually whenever new subconfigurations are added to BLIS.
      Also removed the explicit initial index assigment of 0 from the
      first enum value, which was unnecessary due to how the C language
      standard mandates indexing of enum values. Thanks to Devin Matthews
      for originally submitting this as a PR in #446.
    - Updated docs/ConfigurationHowTo.md to reflect the aforementioned
      change.

commit a4b73de84cdffcbe5cf71969a0f7f0f8202b3510
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Mar 12 17:12:27 2021 -0600

    Disabled _self() and _equal() in bli_pthread API.
    
    Details:
    - Disabled the _self() and _equal() extensions to the bli_pthread API
      introduced in d479654. These functions were disabled after I realized
      that they aren't actually needed yet. Thanks to Devin Matthews for
      helping me reason through the appropriate consumer code that will
      appear in BLIS (eventually) in a future commit. (Also, I could never
      get the Windows branch to link properly in clang builds in AppVeyor.
      See the comment I left in the code, and #485, for more info.)

commit f9d604679d8715bc3e79a8630268446889b51388
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Mar 11 16:57:55 2021 -0600

    Added _self() and _equal() to bli_pthread API.
    
    Details:
    - Expanded the bli_pthread API to include equivalents to pthread_self()
      and pthread_equal(). Implemented these two functions for all three cpp
      branches present within bli_pthread.c: systemless, Windows, and
      Linux/BSD.

commit fa9b3c8f6b3d5717f19832362104413e1a86dfb0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Mar 11 15:13:51 2021 -0600

    Shuffled code in Windows branch of bli_pthreads.c.
    
    Details:
    - Reordered the definitions in the cpp branch in bli_pthreads.c that
      defines the bli_pthreads API in terms of Windows API calls. Also added
      missing comments that mark sections of the API, which brings the code
      into harmony with other cpp branches (as well as bli_pthread.h).

commit 95d4f3934d806b3563f6648d57a4e381d747caf5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Mar 11 13:50:40 2021 -0600

    Moved cpp macro redef of strerror_r to bli_env.c.
    
    Details:
    - Relocated the _MSC_VER-guarded cpp macro re-definition of strerror_r
      (in terms of strerror_s) from bli_thread.h to bli_env.c. It was
      likely left behind in bli_thread.h in a previous commit, when code
      that now resides in bli_env.c was moved from bli_thread.c. (I couldn't
      find any other instance of strerror_r being used in BLIS, so I moved
      the #define directly to bli_env.c rather than place it in bli_env.h.)
      The code that uses strerror_r is currently disabled, though, so this
      commit should have no affect on BLIS.

commit 8a3066c315358d45d4f5b710c54594455f9e8fc6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 9 17:52:59 2021 -0600

    Relocated gemmsup_ref general stride handling.
    
    Details:
    - Moved the logic that checks for general stridedness in any of the
      matrix operands in a gemmsup problem. The logic previously resided
      near the top of bli_gemmsup_int(), which is the thread entry point
      for the parallel region of the current gemmsup implementation. The
      problem with this setup was that the code would attempt to reject
      problems with any general-strided operands by returning BLIS_FAILURE,
      and that return value was then being ignored by the l3_sup thread
      decorator, which unconditionally returns BLIS_SUCCESS. To solve this
      issue, rather than try to manage n return values, one from each of n
      threads, I simply moved the logic into bli_gemmsup_ref(). I didn't
      move it any higher (e.g. bli_gemmsup()) because I still want the
      logic to be part of the current gemmsup handler implementation. That
      is, perhaps someone else will create a different handler, and that
      author wants to handle general stride differently. (We don't want to
      force them into a particular way of handling general stride.)
    - Removed the general stride handling from bli_gemmtsup_int(), even
      though this function is inoperative for now.
    - This commit addresses issue #484. Thanks to RuQing Xu for reporting
      this issue.

commit 670bc7b60f6065893e8ec1bebd2fc9e5ba710dff
Author: Nicholai Tukanov <nicholaitukanov@gmail.com>
Date:   Fri Mar 5 13:53:43 2021 -0600

    Add low-precision POWER10 gemm kernels (#467)
    
    Details:
    - This commit adds a new BLIS sandbox that (1) provides implementations
      based on low-precision gemm kernels, and (2) extends the BLIS typed
      API for those new implementations. Currently, these new kernels can
      only be used for the POWER10 microarchitecture; however, they may
      provide a template for developing similar kernels for other
      microarchitectures (even those beyond POWER), as changes would likely
      be limited to select places in the microkernel and possibly the
      packing routines. The new low-precision operations that are now
      supported include: shgemm, sbgemm, i16gemm, i8gemm, i4gemm. For more
      information, refer to the POWER10.md document that is included in
      'sandbox/power10'.

commit b8dcc5bc75a746807d6f8fa22dc2123c98396bf5 (origin/dev, origin/amd, dev, amd)
Author: RuQing Xu <r-xu@g.ecc.u-tokyo.ac.jp>
Date:   Tue Mar 2 06:58:24 2021 +0800

    Fixed typed API definition for gemmt (#476)
    
    Details:
    - Fixed incorrect definition and prototype of bli_?gemmt() in
      frame/3/bli_l3_tapi.c and .h, respectively. gemmt was previously
      defined identically to gemm, which was wrong because it did not
      take into account the uplo property of C.
    - Fixed incorrect API documentation for her2k/syr2k in BLISTypedAPI.md.
      Specifically, the document erroneously listed only a single transab
      parameter instead of transa and transb.

commit a0e4fe2340a93521e1b1a835a96d0f26dec8406a
Author: Ilknur <ilknuri607@gmail.com>
Date:   Tue Mar 2 02:06:56 2021 +0400

    Fixed double free() in level1v example (#482)
    
    Details:
    - In exampls/tapi/00level1v.c, pointer 'z' was being freed twice and
      pointer 'a' was not being freed at all. This commit correctly frees
      each pointer exactly once.

commit f5871c7e06a75799251d6b55a8a5fbfa1a92cf95
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Feb 28 17:03:57 2021 -0600

    Added complex asm packm kernels for 'haswell' set.
    
    Details:
    - Implemented assembly-based packm kernels for single- and double-
      precision complex domain (c and z) and housed them in the 'haswell'
      kernel set. This means c3xk, c8xk, z3xk, and z4xk are now all
      optimized.
    - Registered the aforementioned packm kernels in the haswell, zen,
      and zen2 subconfigs.
    - Minor modifications to the corresponding s and d packm kernels that
      were introduced in 426ad67.
    - Thanks to AMD, who originally contributed the double-precision real
      packm kernels (d6xk and d8xk), upon which these complex kernels are
      partially based.

commit 426ad679f55264e381eb57a372632b774320fb85
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Feb 27 18:39:56 2021 -0600

    Added assembly packm kernels for 'haswell' set.
    
    Details:
    - Implemented assembly-based packm kernels for single- and double-
      precision real domain (s and d) and housed them in the 'haswell'
      kernel set. This means s6xk, s16xk, d6xk, and d8xk are now all
      optimized.
    - Registered the aforementioned packm kernels in the haswell, zen,
      and zen2 subconfigs.
    - Thanks to AMD, who originally contributed the double-precision real
      packm kernels (d6xk and d8xk), which I have now tweaked and used to
      create comparable single-precision real kernels (s6xk and s16xk).

commit f50c1b7e5886d29efe134e1994d05af9949cd4b6
Merge: 8f39aea1 b3953b93
Author: Devin Matthews <damatthews@smu.edu>
Date:   Mon Feb 1 11:55:51 2021 -0600

    Merge pull request #473 from ajaypanyala/pkgconfig
    
    build: generate pkgconfig file

commit 8f39aea11f80a805b66cff4b4dc5e72727ea461d
Merge: f8db9fb3 2a815d5b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jan 30 17:59:56 2021 -0600

    Merge branch 'dev'

commit f8db9fb33b48844d6b47fdef699625bd9197745a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jan 28 08:04:52 2021 -0600

    Fixed missing parentheses in README.md Citations.

commit b3953b938eee59f79b4a4162ba583a5cb59fa34e
Author: Ajay Panyala <ajay.panyala@gmail.com>
Date:   Tue Jan 12 17:07:04 2021 -0800

    drop CFLAGS in the generated pkgconfig file

commit b02d9376bac31c1a1c7916f44c4946277a1425e2
Author: Ajay Panyala <ajay.panyala@gmail.com>
Date:   Mon Jan 11 20:50:01 2021 -0800

    add datadir

commit d8d8deeb6d8b84adb7ae5fdb88c6dd4f06624a76
Author: Ajay Panyala <ajay.panyala@gmail.com>
Date:   Mon Jan 11 17:47:50 2021 -0800

    generate pkgconfig file

commit 8c65411c7c8737248a6f054ffa0ce008c95cb515
Merge: 328b4f88 874c3f04
Author: Devin Matthews <damatthews@smu.edu>
Date:   Mon Jan 11 16:01:45 2021 -0600

    Merge pull request #471 from flame/fix-470
    
    Fix kernel-to-config mapping for intel64

commit 874c3f04ece9af4d8fdf0e2713e21a259c117656
Author: Devin Matthews <damatthews@smu.edu>
Date:   Fri Jan 8 13:56:30 2021 -0600

    Update configure
    
    Choose last sub-config in the kernel-to-config map if the config list doesn't contain the name of the kernel set. E.g. for "zen: skx knl haswell" pick "haswell" instead of "skx" which was chosen previously. Fixes #470.

commit 2a815d5b365d934cb351b2f2a8cd1366e997b2e1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jan 4 18:03:39 2021 -0600

    Support trsm pre-inversion in 1m, bb, ref kernels.
    
    Details:
    - Expanded support for disabling trsm diagonal pre-inversion to other
      microkernel types, including the reference microkernel as well as the
      kernel implementations for 1m and the pre-broadcast B (bb) format used
      by the power9 subconfig. This builds on the 'haswell' and 'penryn'
      kernel support added in 7038bba. Thanks to Bhaskar Nallani for
      reminding me, in #461 (post-closure), that 1m support was missing from
      that commit.
    - Removed cpp branch of ref_kernels/3/bli_trsm_ref.c that contained the
      omp simd implementation after making a stripped-down copy in 'old'.
      This code has been disabled for some time and it seemed better suited
      to rot away out of sight rather than clutter up a file that is already
      cluttered by the presence of lower and upper versions.
    - Minor comment update to bli_ind_init().

commit c3ed2cbb9f60100fc9beb2a9d75476de9f711dc5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jan 4 16:16:32 2021 -0600

    Enable 1m only if real domain ukr is not reference.
    
    Details:
    - Previously, BLIS would automatically enable use of the 1m method
      for a given precision if the complex domain microkernel was a
      reference kernel. This commit adds an additional constraint so that
      1m is only enabled if the corresponding real domain microkernel is
      NOT reference. That is, BLIS now forgos use of 1m if both the real and
      complex domain kernels are reference implementations. Note that this
      does not prevent 1m from being enabled manually under those
      conditions; it only means that 1m will not be enabled automatically
      at initialization-time.

commit ed50c947385ba3b0b5d550015f38f7f0a31755c0
Merge: 0cef09aa 328b4f88
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jan 4 14:31:44 2021 -0600

    Merge branch 'master' into dev

commit 328b4f8872b4bca9a53d2de8c6e285f3eb13d196
Author: Devin Matthews <damatthews@smu.edu>
Date:   Wed Dec 30 17:54:18 2020 -0600

    Shared object (dylib) was not built correctly for partial build.
    
    The SO build rule used $? instead of $^. Observed on macOS, not sure if it affected Linux or not.

commit ae6ef66ef824da9bc6348bf9d1b588cd4f2ded9b
Author: Devin Matthews <damatthews@smu.edu>
Date:   Wed Dec 30 17:34:55 2020 -0600

    bli_diag_offset_with_trans had wrong return type. Fixes #468.

commit ebcf197fb86fdd0a864ea928140752bc2462e8c6
Merge: 472f138c 21aa67e1
Author: Devin Matthews <damatthews@smu.edu>
Date:   Sat Dec 5 22:26:27 2020 -0600

    Merge pull request #466 from isuruf/patch-3
    
    fix cc_vendor for crosstool-ng toolchains

commit 21aa67e11cebbc5a6dd7c6353154256294df3c33
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Sat Dec 5 21:59:13 2020 -0600

    fix cc_vendor for crosstool-ng toolchains

commit 472f138cb927b7259126ebb9c68919cfcc7a4ea3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Dec 5 14:13:52 2020 -0600

    Fixed typo in README.md to CodingConventions.md.

commit 0cef09aa92208441a656bf097f197ea8e22b533b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Dec 4 16:40:59 2020 -0600

    Consolidated code in level-3 _front() functions.
    
    Details:
    - Reduced a code segment that appears in all of the bli_*_front()
      functions except for bli_gemm_front(). Previously, the code looked
      like this (taken from bli_herk_front()):
    
        if ( bli_cntx_method( cntx ) == BLIS_NAT )
        {
            bli_obj_set_pack_schema( BLIS_PACKED_ROW_PANELS, &a_local );
            bli_obj_set_pack_schema( BLIS_PACKED_COL_PANELS, &ah_local );
        }
        else // if ( bli_cntx_method( cntx ) != BLIS_NAT )
        {
            pack_t schema_a = bli_cntx_schema_a_block( cntx );
            pack_t schema_b = bli_cntx_schema_b_panel( cntx );
    
            bli_obj_set_pack_schema( schema_a, &a_local );
            bli_obj_set_pack_schema( schema_b, &ah_local );
        }
    
      This code segment is part of a sort-of-hack that allows us to
      communicate the pack schemas into the level-3 thread decorator, which
      needs them so that they can be passed into bli_l3_cntl_create_if(),
      where the control tree is created. However, the first conditional case
      above is unnecessary because the second case is fully generalized.
      That is, even in the native case, the context contains correct,
      queryable schemas. Thus, these code segments were reduced to something
      like:
    
        pack_t schema_a = bli_cntx_schema_a_block( cntx );
        pack_t schema_b = bli_cntx_schema_b_panel( cntx );
    
        bli_obj_set_pack_schema( schema_a, &a_local );
        bli_obj_set_pack_schema( schema_b, &ah_local );
    
      There's always a small chance that the seemingly unnecessary code
      in the first branch case has some special use that is not apparent to
      me, but the testsuite's default input parameters seem to think this
      commit will be fine.

commit 7038bbaa05484141195822291cf3ba88cbce4980
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Dec 4 16:08:15 2020 -0600

    Optionally disable trsm diagonal pre-inversion.
    
    Details:
    - Implemented a configure-time option, --disable-trsm-preinversion, that
      optionally disables the pre-inversion of diagonal elements of the
      triangular matrix in the trsm operation and instead uses division
      instructions within the gemmtrsm microkernels. Pre-inversion is
      enabled by default. When it is disabled, performance may suffer
      slightly, but numerical robustness should improve for certain
      pathological cases involving denormal (subnormal) numbers that would
      otherwise result in overflow in the pre-inverted value. Thanks to
      Bhaskar Nallani for reporting this issue via #461.
    - Added preprocessor macro guards to bli_trsm_cntl.c as well as the
      gemmtrsm microkernels for 'haswell' and 'penryn' kernel sets pursuant
      to the aforementioned feature.
    - Added macros to frame/include/bli_x86_asm_macros.h related to division
      instructions.

commit 78aee79452cce2691c40f05b3632bdfc122300af
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Dec 2 13:02:36 2020 -0600

    Allow amaxv testsuite module to run with dim = 0.
    
    Details:
    - Exit early from libblis_test_amaxv_check() when the vector dimension
      (length) of x is 0. This allows the module to run when the testsuite
      driver passes in a problem size of 0. Thanks to Meghana Vankadari for
      alerting us to this issue via #459.
    - Note: All other testsuite modules appear to work with problem sizes
      of 0, except for the microkernel modules. I chose not to "fix" those
      modules because a failure (or segmentation fault, as happens in this
      case) is actually meaningful in that it alerts the developer that some
      microkernels cannot be used with k = 0. Specifically, the 'haswell'
      kernel set contains microkernels that preload elements of B. Those
      microkernels would need to be restructured to avoid preloading in
      order to support usage when k = 0.

commit 92d2b12a44ee0990c22735472aeaf1c17deb2d9b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Dec 2 13:02:00 2020 -0600

    Fixed obscure testsuite gemmt dependency bug.
    
    Details:
    - Fixed a bug in the gemmt testsuite module that only manifested when
      testing of gemmt is enabled but testing of gemv is disabled. The bug
      was due to a copy-paste error dating back to the introduction of gemmt
      in 88ad841.

commit b43dae9a5d2f078c9bbe07079031d6c00a68b7de
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Dec 1 16:44:38 2020 -0600

    Fixed copy-paste bugs in edge-case sup kernels.
    
    Details:
    - Fixed bugs in two sup kernels, bli_dgemmsup_rv_haswell_asm_1x6() and
      bli_dgemmsup_rd_haswell_asm_1x4(), which involved extraneous assembly
      instructions that were left over from when the kernels were first
      written. These instructions would cause segmentation faults in some
      situations where extra memory was not allocated beyond the end of
      the matrix buffers. Thanks to Kiran Varaganti for reporting these
      bugs and to Bhaskar Nallani for identifying the cause and solution.

commit 11dfc176a3c422729f453f6c23204cf023e9954d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Dec 1 19:51:27 2020 +0000

    Reorganized thread auto-factorization logic.
    
    Details:
    - Reorganized logic of bli_thread_partition_2x2() so that the primary
      guts were factored out into "fast" and "slow" variants. Then added
      logic to the "fast" variant that allows for more optimal thread
      factorizations in some situations where there is at least one factor
      of 2.
    - Changed BLIS_THREAD_RATIO_M from 2 to 1 in bli_kernel_macro_defs.h and
      added comments to that file describing BLIS_THREAD_RATIO_? and
      BLIS_THREAD_MAX_?R.
    - In bli_family_zen.h and bli_family_zen2.h, preprocessed out several
      macros not used in vanilla BLIS and removed the unused macro
      BLIS_ENABLE_ZEN_BLOCK_SIZES from the former file.
    - Disabled AMD's small matrix handling entry points in bli_syrk_front.c
      and bli_trsm_front.c. (These branches of small matrix handling have
      not been reviewed by vanilla BLIS developers.)
    - Added commented-out calls printf() to bli_rntm.c.
    - Whitespace changes to bli_thread.c.

commit 6d3bafacd7aa7ad198762b39490876c172bfbbcb
Author: Devin Matthews <damatthews@smu.edu>
Date:   Sat Nov 28 17:17:56 2020 -0600

    Update BuildSystem.md
    
    Add git version >= 1.8.5 requirement (see #462).

commit 64856ea5a61b01d585750815788b6a775f729647
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 23 16:54:51 2020 -0600

    Auto-reduce (by default) prime numbers of threads.
    
    Details:
    - When requesting multithreaded parallelism by specifying the total
      number of threads (whether it be via environment variable, globally at
      runtime, or locally at runtime), reduce the number of threads actually
      used by one if the original value (a) is prime and (b) exceeds a
      minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which is set
      to 11 by default. If, when specifying the total number of threads (and
      not the individual ways of parallelism for each loop), prime numbers
      of threads are desired, this feature may be overridden by defining the
      BLIS_ENABLE_AUTO_PRIME_NUM_THREADS macro in the bli_family_*.h that
      corresponds to the configuration family targeted at configure-time.
      (For now, there is no configure option(s) to control this feature.)
      Thanks to Jeff Diamond for suggesting this change.
    - Defined a new function in bli_thread.c, bli_is_prime(), that returns a
      bool that determines whether an integer is prime. This function is
      implemented in terms of existing functions in bli_thread.c.
    - Updated docs/Multithreading.md to document the above feature, along
      with unrelated minor edits.

commit 55933b6ff6b9b8a12041715f42bba06273d84b74
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Nov 20 10:39:32 2020 -0600

    Added missing attribution to docs/ReleaseNotes.md.

commit e310f57b4b29fbfee479e0f9fe2040851efdec4f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Nov 19 13:33:37 2020 -0600

    CHANGELOG update (0.8.0)

commit 9b387f6d5a010969727ec583c0cdd067a5274ed8 (tag: 0.8.0)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Nov 19 13:33:37 2020 -0600

    Version file update (0.8.0)

commit 2928ec750d3a3e1e5d55de5b57ddc04e9d0bd796
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 18 18:31:35 2020 -0600

    ReleaseNotes.md update in advance of next version.
    
    Details:
    - Updated docs/ReleaseNotes.md in preparation for next version.

commit b9899bedff6854639468daa7a973bb14ca131a74
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 18 16:52:41 2020 -0600

    CREDITS file update.

commit 9bb23e6c2a44b77292a72093938ab1ee6e6cc26a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 16 15:55:45 2020 -0600

    Added support for systemless build (no pthreads).
    
    Details:
    - Added a configure option, --[enable|disable]-system, which determines
      whether the modest operating system dependencies in BLIS are included.
      The most notable example of this on Linux and BSD/OSX is the use of
      POSIX threads to ensure thread safety for when application-level
      threads call BLIS. When --disable-system is given, the bli_pthreads
      implementation is dummied out entirely, allowing the calling code
      within BLIS to remain unchanged. Why would anyone want to build BLIS
      like this? The motivating example was submitted via #454 in which a
      user wanted to build BLIS for a simulator such as gem5 where thread
      safety may not be a concern (and where the operating system is largely
      absent anyway). Thanks to Stepan Nassyr for suggesting this feature.
    - Another, more minor side effect of the --disable-system option is that
      the implementation of bli_clock() unconditionally returns 0.0 instead
      of the time elapsed since some fixed point in the past. The reasoning
      for this is that if the operating system is truly minimal, the system
      function call upon which bli_clock() would normally be implemented
      (e.g. clock_gettime()) may not be available.
    - Refactored preprocess-guarded code in bli_pthread.c and bli_pthread.h
      to remove redundancies.
    - Removed old comments and commented #include of "bli_pthread_wrap.h"
      from bli_system.h.
    - Documented bli_clock() and bli_clock_min_diff() in BLISObjectAPI.md
      and BLISTypedAPI.md, with a note that both are non-functional when
      BLIS is configured with --disable-system.

commit 88ad84143414644df4c56733b1cf91a36bfacaf8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Nov 14 09:39:48 2020 -0600

    Squash-merge 'pr' into 'squash'. (#457)
    
    Merged contributions from AMD's AOCL BLIS (#448).
    
    Details:
    - Added support for level-3 operation gemmt, which performs a gemm on
      only the lower or upper triangle of a square matrix C. For now, only
      the conventional/large code path will be supported (in vanilla BLIS).
      This was accomplished by leveraging the existing variant logic for
      herk. However, some of the infrastructure to support a gemmtsup is
      included in this commit, including
      - A bli_gemmtsup() front-end, similar to bli_gemmsup().
      - A bli_gemmtsup_ref() reference handler function.
      - A bli_gemmtsup_int() variant chooser function (with variant calls
        commented out).
    - Added support for inducing complex domain gemmt via the 1m method.
    - Added gemmt APIs to the BLAS and CBLAS compatiblity layers.
    - Added gemmt test module to testsuite.
    - Added standalone gemmt test driver to 'test' directory.
    - Documented gemmt APIs in BLISObjectAPI.md and BLISTypedAPI.md.
    - Added a C++ template header (blis.hh) containing a BLAS-inspired
      wrapper to a set of polymorphic CBLAS-like function wrappers defined
      in another header (cblas.hh). These two headers are installed if
      running the 'install' target with INSTALL_HH is set to 'yes'. (Also
      added a set of unit tests that exercise blis.hh, although they are
      disabled for now because they aren't compatible with out-of-tree
      builds.) These files now live in the 'vendor' top-level directory.
    - Various updates to 'zen' and 'zen2' subconfigurations, particularly
      within the context initialization functions.
    - Added s and d copyv, setv, and swapv kernels to kernels/zen/1, and
      various minor updates to dotv and scalv kernels. Also added various
      sup kernels contributed by AMD to kernels/zen/3. However, these
      kernels are (for now) not yet used, in part because they caused
      AppVeyor clang failures, and also because I have not found time to
      review and vet them.
    - Output the python found during configure into the definition of PYTHON
      in build/config.mk (via build/config.mk.in).
    - Added early-return checks (A, B, or C with zero dimension; alpha = 0)
      to bli_gemm_front.c.
    - Implemented explicit beta = 0 handling in for the sgemm ukernel in
      bli_gemm_armv7a_int_d4x4.c, which was previously missing. This latent
      bug surfaced because the gemmt module verifies its computation using
      gemm with its beta parameter set to zero, which, on a cortexa15 system
      caused the gemm kernel code to unconditionally multiply the
      uninitialized C data by beta. The C matrix likely contained
      non-numeric values such as NaN, which then would have resulted in a
      false failure.
    - Fixed a bug whereby the implementation for bli_herk_determine_kc(),
      in bli_l3_blocksize.c, was inadvertantly being defined in terms of
      helper functions meant for trmm. This bug was probably harmless since
      the trmm code should have also done the right thing for herk.
    - Used cpp macros to neutralize the various AOCL_DTL_TRACE_ macros in
      kernels/zen/3/bli_gemm_small.c since those macros are not used in
      vanilla BLIS.
    - Added cpp guard to definition of bli_mem_clear() in bli_mem.h to
      accommodate C++'s stricter type checking.
    - Added cpp guard to test/*.c drivers that facilitate compilation on
      Windows systems.
    - Various whitespace changes.

commit 234b8b0cf48f1ee965bd7999b291fc7add3b9a54
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Nov 12 19:11:16 2020 -0600

    Increased dotxaxpyf testsuite thresholds.
    
    Details:
    - Increased the test thresholds used by the dotxaxpyf testsuite module
      by a factor of five in order to avoid residuals that unnecessarily
      fall in the MARGINAL range. This commit should fix #455. Thanks to
      @nagsingh for reporting this issue.

commit ed612dd82c50063cfd23576a6b2465213d31b14b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Nov 7 13:09:42 2020 -0600

    Updated README.md with sgemmsup blurb.
    
    Details:
    - Added an entry to the "What's New" section of the README.md to
      announce the availability of sgemmsup.

commit e14424f55b15d67e8d18384aea45a11b9b772e02
Merge: 0cfe1aac eccdd75a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Nov 7 13:02:50 2020 -0600

    Merge branch 'dev'

commit 0cfe1aac222008a78dff3ee03ef5183413936706
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 30 17:10:36 2020 -0500

    Relocated operation index to ToC in API docs.
    
    Details:
    - Moved the "Operation index" section of both the BLISObjectAPI.md and
      BLISTypedAPI.md docs to appear immediately after the table of contents
      of each document. This allows the reader to quickly jump to the
      documentation for any operation without having to scroll through much
      of the document (when rendered via a web browser).
    - Fixed a mistake in the BLISObjectAPI.md for the setd operation, which
      does *not* observe the diag property of its matrix argument. Thanks to
      Jeff Diamond for reporting this.

commit 2a0682f8e5998be536da313525292f0da6193147
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Oct 18 18:04:03 2020 -0500

    Implemented runtime subconfig selection (#451).
    
    Details:
    - Implemented support for the user manually overriding the automatic
      subconfiguration selection that happens at runtime. This override
      can be requested by setting the BLIS_ARCH_TYPE environment variable.
      The variable must be set to the arch_t id (as enumerated in
      bli_type_defs.h) corresponding to the desired subconfiguration. If a
      value outside this enumerated range is given, BLIS will abort with an
      error message. If the value is in the valid range but corresponds to a
      subconfiguration that was not activated at configure-time/compile-time,
      BLIS will abort with a (different) error message. Thanks to decandia50
      for suggesting this feature via issue #451.
    - Defined a new function bli_gks_lookup_id to return the address of an
      internal data structure within the gks. If this address is NULL, then
      it indicates that the subconfig corresponding to the arch_t id passed
      into the function was not compiled into BLIS. This function is used
      in the second of the two abort scenarios described above.
    - Defined the enumerated error code BLIS_UNINITIALIZED_GKS_CNTX, which
      is returned for the latter of the two abort scenarios mentioned above,
      along with a corresponding error message and a function to perform
      the error check.
    - Added cpp macro branching to bli_env.c to support compilation of the
      auto-detect.x executable during configure-time. This cpp branch is
      similar to the cpp code already found in bli_arch.c and bli_cpuid.c.
    - Cleaned up the auto_detect() function to facilitate easier maintenance
      going forward. Also added a convenient debug switch that outputs the
      compilation command for the auto-detect.x executable and exits.

commit eccdd75a2d8a0c46e91e94036179c49aa5fa601c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 9 15:44:16 2020 -0500

    Whitespace tweak in docs/PerformanceSmall.md.

commit 7677e9ba60ac27496e3421c2acc7c239e3f860e9
Merge: addcd46b a0849d39
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 9 15:41:25 2020 -0500

    Merge branch 'dev' of github.com:flame/blis into dev

commit addcd46b0559d401aa7d33d4c7e6f63f5313a8e0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 9 15:41:09 2020 -0500

    Added Epyc 7742 Zen2 ("Rome") sup perf results.
    
    Details:
    - Added single-threaded and multithreaded sup performance results to
      docs/PerformanceSmall.md for both sgemm and dgemm. These results were
      gathered on an Epyc 7742 "Rome" server featuring AMD's Zen2
      microarchitecture. Special thanks to Jeff Diamond for facilitating
      access to the system via the Oracle Cloud.
    - Updates to octave scripts in test/sup/octave for use with Octave 5.2
      and for use with subplot_tight().
    - Minor updates to octave scripts in test/3/octave.
    - Renamed files containing the previous Zen performance results for
      consistency with the new results.
    - Decreased line thickness slightly in large/conventional Zen2 graphs.
      I'm done tweaking those this time. Really.
    - Added missing line regarding eigen header installation for each
      microarchitecture section.

commit a0849d390d04067b82af937cda8191b049b98915
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 9 20:22:17 2020 +0000

    Register l3 sup kernels in zen2 subconfig.
    
    Details:
    - Registered full suite of sgemm and dgemm sup millikernels, blocksizes,
      and crossover thresholds in bli_cntx_init_zen2.c.
    - Minor updates to test/sup/runme.sh for running on Zen2 Epyc 7742
      system.

commit d98368c32d5fbfaab8966ee331d9bcb5c4fe7a59
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 8 19:05:51 2020 -0500

    Another tweak to line thickness of Zen2 graphs.

commit 1855dfbdaafa37892b36c97fd317fd5d8da76676
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 8 19:01:00 2020 -0500

    Tweaked line thickness in Zen2 graphs once more.
    
    Details:
    - Decreased (relative to previous commit) line thickness in recent Zen2
      graphs.

commit 0991611e7ed82889c53a5c3f1ef1d49552c50d61
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 8 18:54:49 2020 -0500

    Increased line thickness in recent Zen2 graphs.
    
    Details:
    - Increased the width of the lines in the graphs introduced in 74ec6b8.

commit 8273cbacd7799e9af59e5320d66055f2f5d9cb31
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 7 14:51:33 2020 -0500

    README.md, docs/FAQ.md updates.
    
    Details:
    - Added a frequently asked question to docs/FAQ.md regarding the
      difference between upstream (vanilla) BLIS and AMD BLIS.
    - Updated the name of ICES in the README.md to reflect the Oden
      rebranding.

commit a178a822ad3d5021489a0e61f909d8550ae12a8f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Sep 30 16:00:52 2020 -0500

    Added Zen2 links to docs/Performance.md Contents.

commit 74ec6b8f457cabe37d2382aaab35ba04fc737948
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Sep 30 15:54:18 2020 -0500

    Added Epyc 7742 Zen2 ("Rome") performance results.
    
    Details:
    - Added single-threaded and multithreaded performance results to
      docs/Performance.md. These results were gathered on an Epyc 7742
      "Rome" server with AMD's Zen2 microarchitecture. Special thanks
      to Jeff Diamond for facilitating access to the system via the
      Oracle Cloud.
    - Renamed files containing the previous Zen performance results for
      consistency with the new results.

commit bc4a213a2c3dcf8bbfcbb3a1ef3e9fc9e3226c34
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Sep 30 15:28:20 2020 -0500

    Updated matlab (now octave) plot code in test/3.
    
    Details:
    - Renamed test/3/matlab to test/3/octave.
    - Within test/3, updated and tuned plot_l3_perf.m and plot_panel_4x5.m
      files for use with octave (which is free and doesn't crash on me
      mid-way through my use of subplot).
    - Updated runthese.m scratchpad for zen2 invocations.
    - Added Nikolay S.'s subplot_tight() function, along with its license.

commit c77ddc418187e1884fa6bcfe570eee295b9cb8bc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Sep 30 20:15:43 2020 +0000

    Added optional numactl usage to test/3/runme.sh.

commit 2d8ec164e7ae4f0c461c27309dc1f5d1966eb003
Author: Nicholai Tukanov <nicholai@utexas.edu>
Date:   Tue Sep 29 16:52:18 2020 -0500

    Add POWER10 support to BLIS (#450)

commit 4fd8d9fec2052257bf2a5c6e0d48ae619ff6c3e4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Sep 28 23:39:05 2020 +0000

    Tweaked zen2 subconfig's MC cache blocksizes.
    
    Details:
    - Updated the MC cache blocksizes registered by the 'zen2' subconfig.
    - Minor updates to test/3/Makefile and test/3/runme.sh.

commit 5efcdeffd58af621476d179afc0c19c0f912baa8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Sep 25 14:25:24 2020 -0500

    More minor README.md updates.

commit 9e940f8aad6f065ea1689e791b9a4e1fb7900c40
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Sep 25 13:53:35 2020 -0500

    Added 1m SISC bibtex to README.md.
    
    Details:
    - Added final citation info to 1m bibtex in README.md file.
    - Updated draft 1m paper link.
    - Changed some http to https.

commit e293cae2d1b9067261f613f25eaa0e871356b317
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 15 16:09:11 2020 -0500

    Implemented sgemmsup assembly kernels.
    
    Details:
    - Created a set of single-precision real millikernels and microkernels
      comparable to the dgemmsup kernels that already exist within BLIS.
    - Added prototypes for all kernels within bli_kernels_haswell.h.
    - Registered entry-point millikernels in bli_cntx_init_haswell.c and
      bli_cntx_init_zen.c.
    - Added sgemmsup support to the Makefile, runme.sh script, and source
      file in test/sup. This included edits that allow for separate "small"
      dimensions for single- and double-precision as well as for single-
      vs. multithreaded execution.

commit 2765c6f37c11cb7f71cd4b81c64cea6130636c68
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Sep 12 17:48:15 2020 -0500

    Type saga continues; fixed sgemm ukernel signature.
    
    Details:
    - Changed double* pointers in sgemm function signature to float*. At
      this point I've lost track of whether this was my fault or another
      dormant bug like the one described in ece9f6a, but at this point I
      no longer care. It's one of those days (aka I didn't ask for this).

commit 0779559509e0a1af077530d09ed151dac54f32ee
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Sep 12 17:37:21 2020 -0500

    Fixed missing restrict in knl sgemm prototype.
    
    Details:
    - Added a missing 'restrict' qualifier in the sgemm ukernel prototype
      for knl. (Not sure how that code was ever compiling before now.)

commit ece9f6a3ef1b26b53ecf968cd069df7a85b139fb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Sep 12 17:22:42 2020 -0500

    Fixed dormant type bugs in bli_kernels_knl.h.
    
    Details:
    - Fixed dormant type mismatches in the use of the prototype-generating
      macros in bli_kernels_knl.h. Specifically, some float prototypes
      were incorrectly using double as their ctype. This didn't actually
      matter until the type changes in 645d771, as previously those types
      were not used since packm was prototyped with void* pointers.

commit 8ebb3b60e1c4c045ddb48e02de6e246cecde24a4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Sep 12 17:00:47 2020 -0500

    Fixed accidental breakage in 645d771.
    
    Details:
    - In trying to clean up kappa_cast variables in the reference packm
      kernels, which I initally believed to be redundant given the other
      void* -> ctype* changes in 645d771, I accidentally ended up violating
      restrict semantics for 1e/1r packing and possibly other packm kernels.
      (Normally, my pre-commit testsuite run would have caught this, but I
      was unknowingly using an edited input.operations file in which I'd
      disabled most tests as part of unrelated work.) This commit reverts
      the kappa_cast changes in 645d771.

commit 645d771a14ae89aa7131d6f8f4f4a8090329d05e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Sep 12 15:31:56 2020 -0500

    Minor packm kernel type cleanup (void* -> ctype*).
    
    Details:
    - Changed all void* function arguments in reference packm kernels to
      those of the native type (ctype*). These pointers no longer need to
      be void* and are better represented by their native types anyway.
      (See below for details.) Updated knl packm kernels accordingly.
    - In the definition of the PACKM_KER_PROT prototype macro template in
      frame/1m/bli_l1m_ker_prot.h, changed the pointer types for kappa, a,
      and p from void* to ctype*. They were originally void* because these
      function signatures had to share the same type so they could all be
      stored in a single array of that shared type, from which they were
      queried and called by packm_cxk(). This is no longer how the function
      pointers are stored, and so it no longer makes sense to force the
      caller of packm kernels to use void*, only so that the implementor
      of the packm kernels can typecast back to the native datatype within
      the kernel definition. This change has no effect internally within
      BLIS because currently all packm kernels are called after querying
      the function addresses from the context and then typecasting to the
      appropriate function pointer type, which is based upon type-specific
      function pointers like float* and double*.
    - Removed a comment in frame/1m/bli_l1m_ft_ker.h that was outdated and
      misleading due to changes to the handling of packm kernels since
      moving them into the context.

commit 54bf6c35542a297e25bc8efec6067a6df80536f4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Sep 10 15:42:01 2020 -0500

    Minor README.md update.
    
    Details:
    - Added a new entry to the "What people are saying about BLIS" section.

commit e50b4d40462714ae33df284655a2faf7fa35f37c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Sep 9 14:12:53 2020 -0500

    Minor update to README.md (SIAM Best Paper Prize).

commit a8efb72074691e2610372108becd88b4b392299e
Merge: b0c4da17 97e87f2c
Author: Devin Matthews <damatthews@smu.edu>
Date:   Mon Sep 7 16:18:19 2020 -0500

    Merge pull request #434 from flame/intel-zdot
    
    Add an option to change the complex return type.

commit 97e87f2c9f3878a05e1b7c6ec237ee88d9a72a42
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Sep 7 15:56:42 2020 -0500

    Whitespace/comment updates to #434 PR.

commit b0c4da1732b6c6a9ff66f70c36e4722e0f9645ae
Merge: 810e90ee b1b5870d
Author: Devin Matthews <damatthews@smu.edu>
Date:   Mon Sep 7 15:47:54 2020 -0500

    Merge pull request #436 from flame/s390x
    
    Add checks so that s390x is detected as 64-bit.

commit 810e90ee806510c57504f0cf8eeaf608d38bd9dd
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 1 16:11:40 2020 -0500

    Minor README.md update.
    
    Details:
    - Added HPE to list of funders.
    - Changed http to https in funders' website links.

commit 7d411282196e036991c26e52cb5e5f85769c8059
Author: Devin Matthews <damatthews@smu.edu>
Date:   Thu Aug 13 17:50:58 2020 -0500

    Use -O2 for all framework code. (#435)
    
    It seems that -O3 might be causing intermittent problems with the f2c'ed packed and banded code. -O3 is retained for kernel code. Fixes #341 and fixes #342.

commit 9c5b485d356367b0a1288761cd623f52036e7344
Author: Dave Love <dave.love@manchester.ac.uk>
Date:   Fri Aug 7 20:11:18 2020 +0000

    Don't override -mcpu with -march on ARM (#353)
    
    * Use -mcpu for ARM
    See the GCC doc about -march, -mtune, and -mpu and maybe
    https://community.arm.com/developer/tools-software/tools/b/tools-software-ides-blog/posts/compiler-flags-across-architectures-march-mtune-and-mcpu
    
    * Fix typo in flags
    
    * Fix typo in cortexa9 flags
    
    * Modify cortexa53 compilation flags to fix failing BLAS check (#341)

commit c253d14a72a746b670b3ffbb6e81bcafc73d1133
Author: Devin Matthews <damatthews@smu.edu>
Date:   Fri Aug 7 09:39:04 2020 -0500

    Also handle Intel-style complex return in CBLAS interface.

commit 5d653a11a0cc71305d0995507b1733995856f475
Author: Devin Matthews <damatthews@smu.edu>
Date:   Thu Aug 6 17:58:26 2020 -0500

    Update Multithreading.md
    
    Addresses the issue raised in #426.

commit b1b5870dd3f9b1c78cf5f58a53514d73f001fc4c
Author: Devin Matthews <damatthews@smu.edu>
Date:   Thu Aug 6 17:34:20 2020 -0500

    Add checks so that s390x is detected as 64-bit.

commit 882dcb11bfc9ea50aa2f9044621833efd90d42be
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 6 17:28:14 2020 -0500

    Mention example code at top of documentation docs.
    
    Details:
    - Steer the reader towards the example code section of each
      documentation doc (object and typed).
    - Trivial update to examples/oapi/README, examples/tapi/README.

commit f4894512e5bf56ff83701c07dd02972e300741a5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 6 17:20:00 2020 -0500

    Very minor updates to previous commit.

commit adedb893ae8dfacd1dc54035979e15c44d589dbb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 6 17:14:01 2020 -0500

    Documented mutator functions in BLISObjectAPI.md.
    
    Details:
    - Added documentation for commonly-used object mutator functions in
      BLISObjectAPI.md. Previously, only accessor functions were documented.
      Thanks to Jeff Diamond for pointing out this omission.
    - Explicitly set the 'diag' property of objects in oapi example modules
      (08level2.c and 09level3.c).

commit 5b5278ff494888509543a79c09ea82089f6c95d9
Author: Devin Matthews <damatthews@smu.edu>
Date:   Thu Aug 6 14:19:37 2020 -0500

    Use #ifdef instead of #if as macro may be undefined.

commit 7fdc0fc893d0c6727b725ea842053b65be2c20ba
Author: Devin Matthews <damatthews@smu.edu>
Date:   Thu Aug 6 14:03:55 2020 -0500

    Add an option to change the complex return type.
    
    ifort apparently does not return complex numbers in registers as in C/C++ (or gfortran), but instead creates a "hidden" first parameter for the return value. The option --complex-return=gnu|intel has been added, as well as a guess based on a provided FC if not specified (otherwise default to gnu). This option affects the signatures of cdotc, cdotu, zdotc, and zdotu, and a single library cannot be used with both GNU and Intel Fortran compilers. Fixes #433.

commit 6e522e5823b762d4be09b6acdca30faafba56758
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jul 30 19:31:37 2020 -0500

    Mention disabling of sup in docs/Sandboxes.md.
    
    Details:
    - Added language to remind the reader to disable sup if the intended
      behavior is for the sandbox implementation to handle all problem
      sizes, even the smaller ones that would normally be handled by the
      sup code path.

commit 00e14cb6d849e963a2e1ac35e7dbbe186af00a58
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 29 14:24:34 2020 -0500

    Replaced use of bool_t type with C99 bool.
    
    Details:
    - Textually replaced nearly all non-comment instances of bool_t with the
      C99 bool type. A few remaining instances, such as those in the files
      bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and
      bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being
      used not for boolean purposes but to index into an array.
    - This commit constitutes the third phase of a transition toward using
      C99's bool instead of bool_t, which was raised in issue #420. The first
      phase, which cleaned up various typecasts in preparation for using
      bool as the basis for bool_t (instead of gint_t), was implemented by
      commit a69a4d7. The second phase, which redefined the bool_t typedef
      in terms of bool (from gint_t), was implemented by commit 2c554c2.

commit 2c554c2fce885f965a425e727a0314d3ba66c06d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jul 24 15:57:19 2020 -0500

    Redefined bool_t typedef in terms of C99 bool.
    
    Details:
    - Changed the typedef that defines bool_t from:
    
        typedef gint_t bool_t;
    
      where gint_t is a signed integer that forms the basis of most other
      integers in BLIS, to:
    
        typedef bool bool_t;
    
    - Changed BLIS's TRUE and FALSE macro definitions from being in terms of
      integer literals:
    
        #define TRUE  1
        #define FALSE 0
    
      to being in terms of C99 boolean constants:
    
        #define TRUE  true
        #define FALSE false
    
      which are provided by stdbool.h.
    - This commit constitutes the second phase of a transition toward using
      C99's bool instead of bool_t, which will address issue #420. The first
      phase, which cleaned up various typecasts in preparation for using
      bool as the basis for bool_t (instead of gint_t), was implemented by
      commit a69a4d7.

commit e01dd125581cec87f61e15590922de0dc938ec42
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jul 24 15:41:46 2020 -0500

    Fail-safe updates to Makefiles in 'test' dir.
    
    Details:
    - Updated Makefiles in test, test/3, and test/sup so that running any of
      the usual targets without having first built BLIS results in a helpful
      error message. For example, if BLIS is not yet configured, make will
      output:
    
        Makefile:327: *** Cannot proceed: config.mk not detected! Run
        configure first.  Stop.
    
      Similarly, if BLIS is configured but not yet built, make will output:
    
        Makefile:340: *** Cannot proceed: BLIS library not yet built! Run
        make first.  Stop.
    
      In previous commits, these actions would result in a rather cryptic
      make error such as:
    
        make: *** No rule to make target 'test_sgemm_2400_asm_blis_st.x',
        needed by 'blis-nat-st'.  Stop.

commit b4f47f7540062da3463e2cb91083c12fdda0d30a
Author: Devin Matthews <damatthews@smu.edu>
Date:   Fri Jul 24 13:56:13 2020 -0500

    Add BLIS_EXPORT_BLIS to bli_abort. (#429)
    
    Fixes #428.

commit a69a4d7e2f4607c919db30b14535234ce169c789
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 22 16:13:09 2020 -0500

    Cleaned up bool_t usage and various typecasts.
    
    Details:
    - Fixed various typecasts in
    
        frame/base/bli_cntx.h
        frame/base/bli_mbool.h
        frame/base/bli_rntm.h
        frame/include/bli_misc_macro_defs.h
        frame/include/bli_obj_macro_defs.h
        frame/include/bli_param_macro_defs.h
    
      that were missing or being done improperly/incompletely. For example,
      many return values were being typecast as
        (bool_t)x && y
      rather than
        (bool_t)(x && y)
      Thankfully, none of these deficiencies had manifested as actual bugs
      at the time of this commit.
    - Changed the return type of bli_env_get_var() from dim_t to gint_t.
      This reflects the fact that bli_env_get_var() needs to be able to
      return a signed integer, and even though dim_t is currently defined
      as a signed integer, it does not intuitively appear to necessarily be
      signed by inspection (i.e., an integer named "dim_t" for matrix
      "dimension"). Also, updated use of bli_env_get_var() within
      bli_pack.c to reflect the changed return type.
    - Redefined type of thrcomm_t.barrier_sense field from bool_t to gint_t
      and added comments to the bli_thrcomm_*.h files that will explain a
      planned replacement of bool_t with C99's bool type.
    - Note: These changes are being made to facilitate the substitution of
      'bool' for 'bool_t', which will eliminate the namespace conflict with
      arm_sve.h as reported in issue #420. This commit implements the first
      phase of that transition. Thanks to RuQing Xu for reporting this
      issue.
    - CREDITS file update.

commit a6437a5c11d364c6c88af527294d29734d7cc7d6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jul 20 19:21:07 2020 -0500

    Replaced broken ref99 sandbox w/ simpler version.
    
    Details:
    - The 'ref99' sandbox was broken by multiple refactorings and internal
      API changes over the last two years. Rather than try to fix it, I've
      replaced it with a much simpler version based on var2 of gemmsup.
      Why not fix the previous implementation? It occurred to me that the
      old implementation was trying to be a lightly simplified duplication
      of what exists in the framework. Duplication aside, this sandbox
      would have worked fine if it had been completely independent of the
      framework code. The problem was that it was only partially
      independent, with many function calls calling a function in BLIS
      rather than a duplicated/simplified version within the sandbox. (And
      the reason I didn't make it fully independent to begin with was that
      it seemed unnecessarily duplicative at the time.) Maintaining two
      versions of the same implementation is problematic for obvious
      reasons, especially when it wasn't even done properly to begin with.
      This explains the reimplementation in this commit. The only catch is
      that the newer implementation is single-threaded only and does not
      perform any packing on either input matrix (A or B). Basically, it's
      only meant to be a simple placeholder that shows how you could plug
      in your own implementation. Thanks to Francisco Igual for reporting
      this brokenness.
    - Updated the three reference gemmsup kernels (defined in
      ref_kernels/3/bli_gemmsup_ref.c) so that they properly handle
      conjugation of conja and/or conjb. The general storage kernel, which
      is currently identical to the column-storage kernel, is used in the
      new ref99 sandbox to provide basic support for all datatypes
      (including scomplex and dcomplex).
    - Minor updates to docs/Sandboxes.md, including adding the threading
      and packing limitations to the Caveats section.
    - Fixed a comment typo in bli_l3_sup_var1n2m.c (upon which the new
      sandbox implementation is based).

commit bca040be9da542dd9c75d91890fa7731841d733d
Merge: 2605eb4d 171ecc1d
Author: Devin Matthews <damatthews@smu.edu>
Date:   Mon Jul 20 09:27:30 2020 -0500

    Merge pull request #425 from gmargari/patch-1
    
    Update Multithreading.md

commit 171ecc1dc6f055ea39da30e508f711b49a734359
Author: Giorgos Margaritis <gmargari@protonmail.com>
Date:   Mon Jul 20 12:24:06 2020 +0300

    Update Multithreading.md

commit 2605eb4d99d3813c37a624c011aa2459324a6d89
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 15 15:25:19 2020 -0500

    Added missing rv_d?x6 edge cases to sup kernel.
    
    Details:
    - Added support to bli_gemmsup_rv_haswell_asm_d6x8n.c for handling
      various n = 6 edge cases with a single sup kernel call. Previously,
      only n = {4,2,1} were handled explicitly as single kernel calls;
      that is, cases where n = 6 were previously being executed via two
      kernel calls (n = 4 and n = 2).
    - Added commented debug line to testsuite's test_libblis.c.

commit 72f6ed0637dfcb021de04ac7d214d5c87e55d799
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jul 3 17:55:54 2020 -0500

    Declare/define static functions via BLIS_INLINE.
    
    Details:
    - Updated all static function definitions to use the cpp macro
      BLIS_INLINE instead of the static keyword. This allows blis.h to
      use a different keyword (inline) to define these functions when
      compiling with C++, which might otherwise trigger "defined but
      not used" warning messages. Thanks to Giorgos Margaritis for
      reporting this issue and Devin Matthews for suggesting the fix.
    - Updated the following files, which are used by configure's
      hardware auto-detection facility, to unconditionally #define
      BLIS_INLINE to the static keyword (since we know BLIS will be
      compiled with C, not C++):
        build/detect/config/config_detect.c
        frame/base/bli_arch.c
        frame/base/bli_cpuid.c
    - CREDITS file update.

commit 5fc701ac5f94c6300febbb2f24e731aa34f0f34a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 1 15:48:58 2020 -0500

    Added -fomit-frame-pointer option to CKOPTFLAGS.
    
    Details:
    - Added the -fomit-frame-pointer compiler option to the CKOPTFLAGS
      variable in the following make_defs.mk files:
        config/haswell/make_defs.mk
        config/skx/make_defs.mk
      as well as comments that mention why the compiler option is needed.
      This option is needed to prevent the compiler from using the rbp
      frame register (in the very early portion of kernel code, typically
      where k_iter and k_left are defined and computed), which, as of
      1c719c9, is used explicitly by the gemmsup millikernels. Thanks to
      Devin Matthews for identifying this missing option and to Jeff
      Diamond for reporting the original bug in #417.
    - The file
        config/zen/amd_config.mk
      which feeds into the make_defs.mk for both zen and zen2 subconfigs,
      was also touched, but only to add a commented-out compiler option
      (and the aforementioned explanatory comment) since that file already
      uses -fomit-frame-pointer in COPTFLAGS, which forms the basis of
      CKOPTFLAGS.

commit 6af59b705782dada47e45df6634b479fe781d4fe
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 1 14:54:23 2020 -0500

    Fixed disabled edge case optimization in gemmsup.
    
    Details:
    - Fixed an inadvertently disabled edge case optimization in the two
      gemmsup variants in bli_l3_sup_var1n2m.c. Background: These edge case
      optimizations allow the last millikernel operation in the jr loop to
      be executed with inflated an register blocksize if it is the last
      (or only) iteration. For example, if mr=6 and nr=8 and the gemmsup
      problem is m=8, n=100, k=100. (In this case, the panel-block variant
      (var1n) is executed, which places the jr loop in the m dimension.)
      In principle, this problem could be executed as two millikernels: one
      with dimensions 6x100x100, and one as 2x100x100. However, with the
      support for inflated blocksizes in the kernel, the entire 8x100x100
      problem can be passed to the millikernel function, which will then
      execute it more favorably as two 4x100x100 millikernel sub-calls.
      Now, this optimization is disabled under certain circumstances, such
      as when multithreading. Previously, the is_mt predicate was being set
      incorrectly such that it was non-zero even when running
      single-threaded.
    - Upon fixing the is_mt issue above, another bit of code needed to be
      moved so that the result of the optimization could have an impact on
      the assignment of loop bounds ranges to threads.

commit b37634540fab0f9b8d4751b8356ee2e17c9e3b00
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jun 25 16:05:12 2020 -0500

    Support ldims, packing in sup/test drivers.
    
    Details:
    - Updated the test/sup source file (test_gemm.c) and Makefile to support
      building matrices with small or large leading dimensions, and updated
      runme.sh to support executing both kinds of test drivers.
    - Updated runme.sh to allow for executing sup drivers with unpacked (the
      default) or packed matrices (via setting BLIS_PACK_A, BLIS_PACK_B
      environment variables), and for capturing output to files that encode
      both the leading dimension (small or large) and packing status into
      the filenames.
    - Consolidated octave scripts in test/sup/octave_st, test/sup/octave_mt
      into test/sup/octave and updated the octave code in that consolidated
      directory to read the new output filename format (encoding ldim and
      packing). Also added comments and streamlined code, particularly in
      plot_panel_trxsh.m. Tested the octave scripts with octave 5.2.0.
    - Moved old octave_st, octave_mt directories to test/sup/old.

commit ceb9b95a96cc3844ecb43d9af48ab289584e76b6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jun 18 17:15:25 2020 -0500

    Fixed incorrect link to shiftd in BLISTypedAPI.md.
    
    Details:
    - Previously, the entry for shiftd in the Operation index section of
      BLISTypedAPI.md was incorrectly linking to the shiftd operation entry
      in BLISObjectAPI.md. This has been fixed. Thanks to Jeff Diamond for
      helping find this incorrect link.

commit b3c42016818797f79e55b32c8b7d090f9d0aa0ea
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jun 18 14:00:56 2020 -0500

    CREDITS file update.

commit 31af73c11abae03248d959da0f81eacea015b57a
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Thu Jun 18 13:35:54 2020 -0500

    Expand windows instructions (#414)
    
    * Expand windows instructions
    
    * Windows: both static and shared don't work at the same time

commit b5b604e106076028279e6d94dc0e51b8ad48e802
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jun 17 16:42:24 2020 -0500

    Ensure random objects' 1-norms are non-zero.
    
    Details:
    - Fixed an innocuous bug that manifested when running the testsuite on
      extremely small matrices with randomization via the "powers of 2 in
      narrow precision range" option enabled. When the randomization
      function emits a perfect 0.0 to fill a 1x1 matrix, the testsuite will
      then compute 0.0/0.0 during the normalization process, which leads to
      NaN residuals. The solution entails smarter implementaions of randv,
      randnv, randm, and randnm, each of which will compute the 1-norm of
      the vector or matrix in question. If the object has a 1-norm of 0.0,
      the object is re-randomized until the 1-norm is not 0.0. Thanks to
      Kiran Varaganti for reporting this issue (#413).
    - Updated the implementation of randm_unb_var1() so that it loops over
      a call to the randv_unb_var1() implementation directly rather than
      calling it indirectly via randv(). This was done to avoid the overhead
      of multiple calls to norm1v() when randomizing the rows/columns of a
      matrix.
    - Updated comments.

commit 35e38fb693e7cbf2f3d7e0505a63b2c05d3f158d
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Tue Jun 16 10:59:41 2020 -0500

    FIx typo in FAQ

commit 1c719c91a3ef0be29a918097652beef35647d4b2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jun 4 17:21:08 2020 -0500

    Bugfixes, cleanup of sup dgemm ukernels.
    
    Details:
    - Fixed a few not-really-bugs:
      - Previously, the d6x8m kernels were still prefetching the next upanel
        of A using MR*rs_a instead of ps_a (same for prefetching of next
        upanel of B in d6x8n kernels using NR*cs_b instead of ps_b). Given
        that the upanels might be packed, using ps_a or ps_b is the correct
        way to compute the prefetch address.
      - Fixed an obscure bug in the rd_d6x8m kernel that, by dumb luck,
        executed as intended even though it was based on a faulty pointer
        management. Basically, in the rd_d6x8m kernel, the pointer for B
        (stored in rdx) was loaded only once, outside of the jj loop, and in
        the second iteration its new position was calculated by incrementing
        rdx by the *absolute* offset (four columns), which happened to be the
        same as the relative offset (also four columns) that was needed. It
        worked only because that loop only executed twice. A similar issue
        was fixed in the rd_d6x8n kernels.
    - Various cleanups and additions, including:
      - Factored out the loading of rs_c into rdi in rd_d6x8[mn] kernels so
        that it is loaded only once outside of the loops rather than
        multiple times inside the loops.
      - Changed outer loop in rd kernels so that the jump/comparison and
        loop bounds more closely mimic what you'd see in higher-level source
        code. That is, something like:
          for( i = 0; i < 6; i+=3 )
        rather than something like:
          for( i = 0; i <= 3; i+=3 )
      - Switched row-based IO to use byte offsets instead of byte column
        strides (e.g. via rsi register), which were known to be 8 anyway
        since otherwise that conditional branch wouldn't have executed.
      - Cleaned up and homogenized prefetching a bit.
      - Updated the comments that show the before and after of the
        in-register transpositions.
      - Added comments to column-based IO cases to indicate which columns
        are being accessed/updated.
      - Added rbp register to clobber lists.
      - Removed some dead (commented out) code.
      - Fixed some copy-paste typos in comments in the rv_6x8n kernels.
      - Cleaned up whitespace (including leading ws -> tabs).
      - Moved edge case (non-milli) kernels to their own directory, d6x8,
        and split them into separate files based on the "NR" value of the
        kernels (Mx8, Mx4, Mx2, etc.).
      - Moved config-specific reference Mx1 kernels into their own file
        (e.g. bli_gemmsup_r_haswell_ref_dMx1.c) inside the d6x8 directory.
      - Added rd_dMx1 assembly kernels, which seems marginally faster than
        the corresponding reference kernels.
      - Updated comments in ref_kernels/bli_cntx_ref.c and changed to using
        the row-oriented reference kernels for all storage combos.

commit 943a21def0bedc1732c0a2453afe7c90d7f62e95
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Thu May 21 14:09:21 2020 -0500

    Add build instructions for Windows (#404)

commit fbef422f0d968df10e598668b427af230cfe07e8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu May 21 10:30:41 2020 -0500

    Separate OS X and Windows into separate FAQs.
    
    Details:
    - Separated the unified Mac OS X / Windows frequently asked question
      into two separate questions, one for each OS.

commit 28be1a4265ea67e3f177c391aba3dbbcf840bd52
Author: Guodong Xu <guodong.xu@linaro.org>
Date:   Thu May 21 02:22:22 2020 +0800

    avoid loading twice in armv8a gemm kernel (#403)
    
    This bug happens at a corner case, when k_iter == 0 and we jump to
    CONSIDERKLEFT.
    
    In current design, first row/col. of a and b are loaded twice.
    
    The fix is to rearrange a and b (first row/col.) loading instructions.
    
    Signed-off-by: Guodong Xu <guodong.xu@linaro.org>

commit d51245e58b0beff2717156b980007c90337150d8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri May 8 18:00:54 2020 -0500

    Add support for Intel oneAPI in configure.
    
    Details:
    - Properly select cc_vendor based on the output of invoking CC with the
      --version option, including cases where CC is the variant of clang
      that is included with Intel oneAPI. (However, we continue to treat
      the compiler as clang for other purposes, not icc.) Thanks to Ajay
      Panyala and Devin Matthews for reporting on this issue via #402.

commit 787adad73bd5eb65c12c39d732723a1ac0448748
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri May 8 16:18:20 2020 -0500

    Defined netlib equivalent of xerbla_array().
    
    Details:
    - Added a function definition for xerbla_array_(), which largely mirrors
      its netlib implementation. Thanks to Isuru Fernando for suggesting the
      addition of this function.

commit c53b5153bee585685bf95ce22e058a7af72ecef0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 5 12:39:12 2020 -0500

    Documented Perl prerequisite for build system.
    
    Details:
    - Added Perl to list of prerequisites for building BLIS. This is in part
      (and perhaps completely?) due to some substitution commands used at
      the end of configure that include '\n' characters that are not
      properly interpreted by the version of sed included on some versions
      of OS X. This new documentation addresses issue #398.

commit f032d5d4a6ed34c8c3e5ba1ed0b14d1956d0097c
Author: Guodong Xu <guodong.xu@linaro.org>
Date:   Thu Apr 30 01:08:46 2020 +0800

    New kernel set for Arm SVE using assembly (#396)
    
    Here adds two kernels for Arm SVE vector extensions.
    1. a gemm  kernel for double at sizes 8x8.
    2. a packm kernel for double at dimension 8xk.
    
    To achive best performance, variable length agonostic programming
    is not used. Vector length (VL) of 256 bits is mandated in both kernels.
    Kernels to support other VLs can be added later.
    
    "SVE is a vector extension for AArch64 execution mode for the A64
    instruction set of the Armv8 architecture. Unlike other SIMD architectures,
    SVE does not define the size of the vector registers, but constrains into
    a range of possible values, from a minimum of 128 bits up to a maximum of
    2048 in 128-bit wide units. Therefore, any CPU vendor can implement the
    extension by choosing the vector register size that better suits the
    workloads the CPU is targeting. Instructions are provided specifically
    to query an implementation for its register size, to guarantee that
    the applications can run on different implementations of the ISA without
    the need to recompile the code."  [1]
    
    [1] https://developer.arm.com/solutions/hpc/resources/hpc-white-papers/arm-scalable-vector-extensions-and-application-to-machine-learning
    
    Signed-off-by: Guodong Xu <guodong.xu@linaro.org>

commit 4d87eb24e8e1f5a21e04586f6df4f427bae0091b
Author: Yingbo Ma <mayingbo5@gmail.com>
Date:   Mon Apr 27 17:02:47 2020 -0400

    Update KernelsHowTo.md (#395)

commit 477ce91c5281df2bbfaddc4d86312fb8c8f879e2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Apr 22 14:26:49 2020 -0500

    Moved #include "cpuid.h" to bli_cpuid.c.
    
    Details:
    - Relocated the #include "cpuid.h" directive from bli_cpuid.h to
      bli_cpuid.c. This was done because cpuid.h (which is pulled into
      the post-build blis.h developer header) doesn't protect its
      definitions with a preprocessor guard of the form:
    
        #ifndef FOOBAR_H
        #define FOOBAR_H
        // header contents.
        #endif
    
      and as a result, applications (previously) could not #include both
      blis.h and cpuid.h (since the former was already including the
      latter). Thanks to Bhaskar Nallani for raising this issue via #393
      and to Devin Matthews for suggesting this fix.
    - CREDITS file update.

commit 8bde63ffd7474a97c3a3b0b0dc1eae45be0ab889
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Apr 18 12:50:12 2020 -0500

    Adding missing conjy to her2/syr2 in typed API doc.
    
    Details:
    - Fixed a missing argument (conjy) in the function signatures of
      bli_?her2() and bli_?syr2() in docs/BLISTypedAPI.md. Thanks to Robert
      van de Geijn for reporting this omission.

commit 976902406b610afdbacb2d80a7a2b4b43ff30321
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Apr 17 15:11:10 2020 -0500

    Disable packing by default in expert rntm_t init.
    
    Details:
    - Changed the behavior of bli_rntm_init() as well as the static
      initializer, BLIS_RNTM_INITIALIZER, so that user-initialized rntm_t
      objects by default specify the disabling of packing for A and B.
      Packing of A/B was already disabled by default when calling non-expert
      APIs (and enabled only when the user set environment variables
      BLIS_PACK_A or BLIS_PACK_B). With this commit, the default behavior of
      using user-initialized rntm_t objects with expert APIs comes into line
      with the default behavior of non-expert APIs--that is, they now both
      lead to the avoidance of packing in the sup code path. (Note: The
      conventional code path is unaffected by the environment variables
      BLIS_PACK_A/BLIS_PACK_B and/or the disabling of packing in a rntm_t
      object when calling an expert API.) This addresses issue #392. Thanks
      to Kiran Varaganti for bringing this inconsistency to our attention.
    - The above change was accomplished by changing the the definitions of
      static functions bli_rntm_clear_pack_a() and bli_rntm_clear_pack_b()
      in bli_rntm.h, which are both for internal use only.

commit 5f2aee7c5fa5d562acaf8fbde3df0e2a04e1dd1b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 7 14:55:15 2020 -0500

    README.md update to promote supmt dgemm.
    
    Details:
    - Updated the sup entry in the "What's New" section of the README.md
      file to promote the multithreaded dgemm sup feature introduced in
      c0558fd.

commit f5923cd9ff5fbd91190277dea8e52027174a1d57
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 7 14:41:45 2020 -0500

    CHANGELOG update (0.7.0)

commit 68b88aca6692c75a9f686187e6c4a4e196ae60a9 (tag: 0.7.0)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 7 14:41:44 2020 -0500

    Version file update (0.7.0)

commit b04de636c1702e4cb8e7ad82bab3cf43d2dbdfc6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 7 14:37:43 2020 -0500

    ReleaseNotes.md update in advance of next version.
    
    Details:
    - Updated docs/ReleaseNotes.md in preparation for next version.

commit 2cb604ba472049ad498df72d4a2dc47a161d4c3c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 6 16:42:14 2020 -0500

    Rename more bli_thread_obarrier(), _obroadcast().
    
    Details:
    - Renamed instances of bli_thread_obarrier() and bli_thread_obroadcast()
      that were made in the supmt-specific code commited to the 'amd'
      branch, which has now been merged with 'master'. Prior to the merge,
      'master' received commit c01d249, which applied these renamings to
      the existing, non-sup codebase.

commit efb12bc895de451067649d5dceb059b7827a025f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 6 15:01:53 2020 -0500

    Minor updates/elaborations to RELEASING file.

commit 2e3b3782cfb7a2fd0d1a325844983639756def7d
Merge: 9f3a8d4d da0c086f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 6 14:55:35 2020 -0500

    Merge branch 'master' into amd

commit da0c086f4643772e111318f95a712831b0f981a8
Author: Satish Balay <balay@mcs.anl.gov>
Date:   Tue Mar 31 17:09:41 2020 -0500

    OSX: specify the full path to the location of libblis.dylib (#390)
    
    * OSX: specify the full path to the location of libblis.dylib so that it can be found at runtime
    
    Before this change:
    
    Appication gives runtime error [when linked with blis]
    dyld: Library not loaded: libblis.3.dylib
    
    balay@kpro lib % otool -L libblis.dylib
    libblis.dylib:
            libblis.3.dylib (compatibility version 0.0.0, current version 0.0.0)
            /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1281.0.0)
    
    After this change:
    balay@kpro lib % otool -L libblis.dylib
    libblis.dylib:
            /Users/balay/petsc/arch-darwin-c-debug/lib/libblis.3.dylib (compatibility version 0.0.0, current version 0.0.0)
            /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1281.0.0)
    
    * INSTALL_LIBDIR -> libdir as INSTALL_LIBDIR has DESTDIR
    
    Co-Authored-By: Jed Brown <jed@jedbrown.org>
    
    * CREDITS file update.
    
    Co-authored-by: Jed Brown <jed@jedbrown.org>
    Co-authored-by: Field G. Van Zee <field@cs.utexas.edu>

commit 2bca03ea9d87c0da829031a5332545d05e352211
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Mar 28 22:10:00 2020 +0000

    Updates, tweaks to runme.sh in test/1m4m.
    
    Details:
    - Made several updates to test/1m4m/runme.sh, including:
      - Added missing handling for 1m and 4m1a implementations when setting
        the BLIS_??_NT environment variables.
      - Added support for using numactl to run the test executables.
      - Several other cleanups.

commit c40a33190b94af5d5c201be63366594859b1233f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Mar 26 16:55:00 2020 -0500

    Warn user when auto-detection returns 'generic'.
    
    Details:
    - Added logic to configure that causes the script to output a warning
      to the user if/when "./configure auto" is run and the underlying
      hardware feature detection code is unable to identify the hardware.
      In these cases, the auto-detect code will return 'generic', which
      is likely not what the user expected, and a flag will be set so that
      a message is printed at the end of the configure output. (Thankfully,
      we don't expect this scenario to play out very often.) Thanks to
      Devin Matthews for suggesting this fix #384.

commit 492a736fab5b9c882996ca024b64646877f22a89
Author: Devin Matthews <damatthews@smu.edu>
Date:   Tue Mar 24 17:28:47 2020 -0500

    Fix vectorized version of bli_amaxv (#382)
    
    * Fix vectorized version of bli_amaxv
    
    To match Netlib, i?amax should return:
    - the lowest index among equal values
    - the first NaN if one is encountered
    
    * Fix typos.
    
    * And another one...
    
    * Update ref. amaxv kernel too.
    
    * Re-enabled optimized amaxv kernels.
    
    Details:
    - Re-enabled the optimized, intrinsics-based amaxv kernels in the 'zen'
      kernel set for use in haswell, zen, zen2, knl, and skx subconfigs.
      These two kernels (for s and d datatypes) were temporarily disabled in
      e186d71 as part of issue #380. However, the key missing semantic
      properties that prompted the disabling of these kernels--returning the
      index of the *first* rather than of the last element with largest
      absolute value, and returning the index of the first NaN if one is
      encountered--were added as part of #382 thanks to Devin Matthews.
      Thus, now that the kernels are working as expected once more, this
      commit causes these kernels to once again be registered for the
      affected subconfigs, which effectively reverts all code changes
      included in e186d71.
    - Whitespace/formatting updates to new macros in bli_amaxv_zen_int.c.
    
    Co-authored-by: Field G. Van Zee <field@cs.utexas.edu>

commit e186d7141a51f2d7196c580e24e7b7db8f209db9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Mar 21 18:40:36 2020 -0500

    Disabled optimized amaxv kernels.
    
    Details:
    - Disabled use of optimized amaxv kernels, which use vector intrinsics
      for both 's' and 'd' datatypes. We disable these kernels because the
      current implementations fail to observe a semantic property of the
      BLAS i?amax_() subroutine, which is to return the index of the
      *first* element containing the maximum absolute value (that is, the
      first element if there exist two or more elements that contain the
      same value). With the optimized kernels disabled, the affected
      subconfigurations (haswell, zen, zen2, knl, and skx) will use the
      default reference implementations. Thanks to Mat Cross for reporting
      this issue via #380.
    - CREDITS file update.

commit 9f3a8d4d851725436b617297231a417aa9ce8c6a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Mar 14 17:48:43 2020 -0500

    Added missing return to bli_thread_partition_2x2().
    
    Details:
    - Added a missing return statement to the body of an early case handling
      branch in bli_thread_partition_2x2(). This bug only affected cases
      where n_threads < 4, and even then, the code meant to handle cases
      where n_threads >= 4 executes and does the right thing, albeit using
      more CPU cycles than needed. Nonetheless, thanks to Kiran Varaganti
      for reporting this bug via issue #377.
    - Whitespace changes to bli_thread.c (spaces -> tabs).

commit 8c3d9b9eeb6f816ec8c32a944f632a5ad3637593
Merge: 71249fe8 0f9e0399
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 10 14:03:33 2020 -0500

    Merge branch 'amd' of github.com:flame/blis into amd

commit 71249fe8ddaa772616698f1e3814d40e012909ea
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 10 13:55:29 2020 -0500

    Merged test/sup, test/supmt into test/sup.
    
    Details:
    - Updated the Makefile, test_gemm.c, and runme.sh in test/sup to be able
      to compile and run both single-threaded and multithreaded experiments.
      This should help with maintenance going forward.
    - Created a test/sup/octave_st directory of scripts (based on the
      previous test/sup/octave scripts) as well as a test/sup/octave_mt
      directory (based on the previous test/supmt/octave scripts). The
      octave scripts are slightly different and not easily mergeable, and
      thus for now I'll maintain them separately.
    - Preserved the previous test/sup directory as test/sup/old/supst and
      the previous test/supmt directory as test/sup/old/supmt.

commit 0f9e0399e16e96da2620faf2c0c3c21274bb2ebd
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Mar 5 17:03:21 2020 -0600

    Updated sup performance graphs; added mt results.
    
    Details:
    - Reran all existing single-threaded performance experiments comparing
      BLIS sup to other implementations (including the conventional code
      path within BLIS), using the latest versions (where appropriate).
    - Added multithreaded results for the three existing hardware types
      showcased in docs/PerformanceSmall.md: Kaby Lake, Haswell, and Epyc
      (Zen1).
    - Various minor updates to the text in docs/PerformanceSmall.md.
    - Updates to the octave scripts in test/sup/octave, test/supmt/octave.

commit 90db88e5729732628c1f3acc96eeefab49f2da41
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Mar 2 15:06:48 2020 -0600

    Updated sup[mt] Makefiles for variable dim ranges.
    
    Details:
    - Updated test/sup/Makefile and test/supmt/Makefile to allow specifying
      different problem size ranges for the drivers where one, two, or three
      matrix dimensions is large. This will facilitate the generation of
      more meaningful graphs, particularly when two dimensions are tiny.

commit 31f11a06ea9501724feec0d2fc5e4644d7dd34fc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 27 14:33:20 2020 -0600

    Updates to octave scripts in test/sup[mt]/octave.
    
    Details:
    - Optimized scripts in test/sup/octave and test/supmt/octave for use
      with octave 5.2.0 on Ubuntu 18.04.
    - Fixed stray 'end' keywords in gen_opsupnames.m and plot_l3sup_perf.m,
      which were not only unnecessary but also causing issues with versions
      5.x.

commit c01d249d7c546fe2e3cee3fe071cd4c4c88b9115
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 25 14:50:53 2020 -0600

    Renamed bli_thread_obarrier(), _obroadcast().
    
    Details:
    - Renamed two bli_thread_*() APIs:
        bli_thread_obarrier()   -> bli_thread_barrier()
        bli_thread_obroadcast() -> bli_thread_broadcast()
      The 'o' was a leftover from when thrcomm_t objects tracked both
      "inner" and "outer" communicators. They have long since been
      simplified to only support the latter, and thus the 'o' is
      superfluous.

commit f6e6bf73e695226c8b23fe7900da0e0ef37030c1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Feb 24 17:52:23 2020 -0600

    List Gentoo under supported external packages.
    
    Details:
    - Add mention of Gentoo Linux under the list of external packages in
      the README.md file. Thanks to M. Zhou for maintaining this package.

commit 9e5f7296ccf9b3f7b7041fe1df20b927cd0e914b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 18 15:16:03 2020 -0600

    Skip building thrinfo_t tree when mt is disabled.
    
    Details:
    - Return early from bli_thrinfo_sup_grow() if the thrinfo_t object
      address is equal to either &BLIS_GEMM_SINGLE_THREADED or
      &BLIS_PACKM_SINGLE_THREADED.
    - Added preprocessor logic to bli_l3_sup_thread_decorator() in
      bli_l3_sup_decor_single.c that (by default) disables code that
      creates and frees the thrinfo_t tree and instead passes
      &BLIS_GEMM_SINGLE_THREADED as the thrinfo_t pointer into the
      sup implementation.
    - The net effect of the above changes is that a small amount of
      thrinfo_t overhead is avoided when running small/skinny dgemm
      problems when BLIS is compiled with multithreading disabled.

commit 90081e6a64b5ccea9211bdef193c2d332c68492f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Feb 17 14:57:25 2020 -0600

    Fixed bug(s) in mt sup when single-threaded.
    
    Details:
    - Fixed a syntax bug in bli_l3_sup_decor_single.c as a result of
      changing function interface for the thread entry point function
      (of type l3supint_t).
    - Unfortunately, fixing the interface was not enough, as it caused
      a memory leak in the sba at bli_finalize() time. It turns out that,
      due to the new multithreading-capable variant code useing thrinfo_t
      objects--specifically, their calling of bli_thrinfo_grow()--we
      have to pass in a real thrinfo_t object rather than the global
      objects &BLIS_PACKM_SINGLE_THREADED or &BLIS_GEMM_SINGLE_THREADED.
      Thus, I inserted the appropriate logic from the OpenMP and pthreads
      versions so that single-threaded execution would work as intended
      with the newly upgraded variants.

commit c0558fde4511557c8f08867b035ee57dd2669dc6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Feb 17 14:08:08 2020 -0600

    Support multithreading within the sup framework.
    
    Details:
    - Added multithreading support to the sup framework (via either OpenMP
      or pthreads). Both variants 1n and 2m now have the appropriate
      threading infrastructure, including data partitioning logic, to
      parallelize computation. This support handles all four combinations
      of packing on matrices A and B (neither, A only, B only, or both).
      This implementation tries to be a little smarter when automatic
      threading is requested (e.g. via BLIS_NUM_THREADS) in that it will
      recalculate the factorization in units of micropanels (rather than
      using the raw dimensions) in bli_l3_sup_int.c, when the final
      problem shape is known and after threads have already been spawned.
    - Implemented bli_?packm_sup_var2(), which packs to conventional row-
      or column-stored matrices. (This is used for the rrc and crc storage
      cases.) Previously, copym was used, but that would no longer suffice
      because it could not be parallelized.
    - Minor reorganization of packing-related sup functions. Specifically,
      bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]()
      instead of from the variant functions. This has the effect of making
      the variant functions more readable.
    - Added additional bli_thrinfo_set_*() static functions to bli_thrinfo.h
      and inserted usage of these functions within bli_thrinfo_init(), which
      previously was accessing thrinfo_t fields via the -> operator.
    - Renamed bli_partition_2x2() to bli_thread_partition_2x2().
    - Added an auto_factor field to the rntm_t struct in order to track
      whether automatic thread factorization was originally requested.
    - Added new test drivers in test/supmt that perform multithreaded sup
      tests, as well as appropriate octave/matlab scripts to plot the
      resulting output files.
    - Added additional language to docs/Multithreading.md to make it clear
      that specifying any BLIS_*_NT variable, even if it is set to 1, will
      be considered manual specification for the purposes of determining
      whether to auto-factorize via BLIS_NUM_THREADS.
    - Minor comment updates.

commit d7a7679182d72a7eaecef4cd9b9a103ee0a7b42b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Feb 7 17:37:03 2020 -0600

    Fixed int-to-packbuf_t conversion error (C++ only).
    
    Details:
    - Fixed an error that manifests only when using C++ (specifically,
      modern versions of g++) to compile drivers in 'test' (and likely most
      other application code that #includes blis.h. Thanks to Ajay Panyala
      for reporting this issue (#374).

commit d626112b8d5302f9585fb37a8e37849747a2a317
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jan 15 13:27:02 2020 -0600

    Removed sorting on LDFLAGS in common.mk (#373).
    
    Details:
    - Removed a line of code in common.mk that passed LDFLAGS through the
      sort function. The purpose was not to sort the contents, but rather
      to remove duplicates. However, there is valid syntax in a string of
      linker flags that, when sorted, yields different/broken behavior.
      So I've removed the line in common.mk that sorts LDFLAGS. Also, for
      future use, I've added a new function, rm-dupls, that removes
      duplicates without sorting. (This function was based on code from a
      stackoverflow thread that is linked to in the comments for that
      code.) Thanks to Isuru Fernando for reporting this issue (#373).

commit e67deb22aaeab5ed6794364520190936748ef272
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jan 14 16:01:34 2020 -0600

    CHANGELOG update (0.6.1)

commit 10949f528c5ffc5c3a2cad47fe16a802afb021be (tag: 0.6.1)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jan 14 16:01:33 2020 -0600

    Version file update (0.6.1)

commit 5db8e710a2baff121cba9c63b61ca254a2ec097a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jan 14 15:59:59 2020 -0600

    ReleaseNotes.md update in advance of next version.
    
    Details:
    - Updated ReleaseNotes.md in preparation for next version.

commit cde4d9d7a26eb51dcc5a59943361dfb8fda45dea
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jan 14 15:19:25 2020 -0600

    Removed 'attic/windows' (to prevent confusion).
    
    Details:
    - Finally removed 'attic/windows' and its contents. This directory once
      contained "proto" Windows support for BLIS, but we've since moved on
      to (thanks to Isuru Fernando) providing Windows DLL support via
      AppVeyor's build artifacts. Furthermore, since 'windows' was the only
      subdirectory within 'attic', the directory path would show up in
      GitHub's listing at https://github.com/flame/blis, which probably led
      to someone being confused about how BLIS provides Windows support. I
      assume (but don't know for sure) that nobody is using these files, so
      this is admittedly a case of shoot first and ask questions later.

commit 7d3407d4681c6449f4bbb8ec681983700ab968f3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jan 14 15:17:53 2020 -0600

    CREDITS file update.

commit f391b3e2e7d11a37300d4c8d3f6a584022a599f5
Author: Dave Love <dave.love@manchester.ac.uk>
Date:   Mon Jan 6 20:15:48 2020 +0000

    Fix parsing in vpu_count on workstation SKX (#351)
    
    * Fix parsing in vpu_count on workstation SKX
    
    * Document Skylake-X as Haswell for single FMA
    
    * Update vpu_count for Skylake and Cascade Lake models
    
    * Support printing the configuration selected, controlled by the environment
    
    Intended particularly for diagnosing mis-selection of SKX through
    unknown, or incorrect, number of VPUs.
    
    * Move bli_log outside the cpp condition, and use it where intended
    
    * Add Fixme comment (Skylake D)
    
    * Mostly superficial edits to commits towards #351.
    
    Details:
    - Moved architecture/sub-config logging-related code from bli_cpuid.c
      to bli_arch.c, tweaked names, and added more set/get layering.
    - Tweaked log messages output from bli_cpuid_is_skx() in bli_cpuid.c.
    - Content, whitespace changes to new bullet in HardwareSupport.md that
      relates to single-VPU Skylake-Xs.
    
    * Fix comment typos
    
    Co-authored-by: Field G. Van Zee <field@cs.utexas.edu>

commit 5ca1a3cfc1c1cc4dd9da6a67aa072ed90f07e867
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jan 6 12:29:12 2020 -0600

    Fixed 'configure' breakage introduced in 6433831.
    
    Details:
    - Added a missing 'fi' (endif) keyword to a conditional block added in
      the configure script in commit 6433831.

commit e7431b4a834ef4f165c143f288585ce8e2272a23
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jan 6 12:01:41 2020 -0600

    Updated 1m draft article link in README.md.

commit 6433831cc3988ad205637ebdebcd6d8f7cfcf148
Author: Jeff Hammond <jeff.r.hammond@intel.com>
Date:   Fri Jan 3 17:52:49 2020 -0800

    blacklist ICC 18 for knl/skx due to test failures
    
    Signed-off-by: Jeff Hammond <jeff.r.hammond@intel.com>

commit af3589f1f98781e3a94a8f9cea8d5ea6f155f7d2
Author: Jeff Hammond <jeff.science@gmail.com>
Date:   Fri Jan 3 13:23:24 2020 -0800

    blacklist Intel 19+
    
    Signed-off-by: Jeff Hammond <jeff.r.hammond@intel.com>

commit 60de939debafb233e57fd4e804ef21b6de198caf
Author: Jeff Hammond <jeff.science@gmail.com>
Date:   Wed Jan 1 21:30:38 2020 -0800

    fix link to docs
    
    the comment contains an incorrect link, which is trivially fixed here.
    
    @fgvanzee I hope you don't mind that I committed directly to master but this cannot break anything.

commit 52711073789b6b84eb99bb0d6883f457ed3fcf80
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 16 16:30:26 2019 -0600

    Fixed bugs in cblas_sdsdot(), sdsdot_().
    
    Details:
    - Fixed a bug in sdsdot_sub() that redundantly added the "alpha" scalar,
      named 'sb'. This value was already being added by the underlying
      sdsdot_() function. Thus, we no longer add 'sb' within sdsdot_sub().
      Thanks to Simon Lukas Märtens for reporting this bug via #367.
    - Fixed a second bug in order of typecasting intermediate products in
      sdsdot_(). Previously, the "alpha" scalar was being added after the
      "outer" typecast to float. However, the operation is supposed to first
      add the dot product to the (promoted) scalar and THEN downcast the sum
      to float. Thanks to Devin Matthews for catching this bug.

commit fe2560a4b1d8ef8d0a446df6002b1e7decc826e9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Dec 6 17:12:44 2019 -0600

    Annoted missing thread-related symbols for export.
    
    Details:
    - Added BLIS_EXPORT_BLIS annotation to function prototypes for
    
        bli_thrcomm_bcast()
        bli_thrcomm_barrier()
        bli_thread_range_sub()
    
      so that these functions are exported to shared libraries by default.
      This (hopefully) fixes issue #366. Thanks to Kyungmin Lee for
      reporting this bug.
    - CREDITS file update.

commit 2853825234001af8f175ad47cef5d6ff9b7a5982
Merge: efa61a6c 61b1f0b0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Dec 6 16:06:46 2019 -0600

    Merge branch 'master' into amd

commit 61b1f0b0602faa978d9912fe58c6c952a33af0ac
Author: Nicholai Tukanov <nicholai@utexas.edu>
Date:   Wed Dec 4 14:18:47 2019 -0600

    Add prototypes for POWER9 reference kernels (#365)
    
    Updates and fixes to power9 subconfig.
    
    Details:
    - Register s,c,z reference gemm and trsm ukernels that assume elements
      of B have been broadcast.
    - Added prototypes for level-3 ukernels that assume elements of B have
      been broadcast. Also added prototype for an spackm function that
      employs a duplication/broadcast factor of 4.
    - Register virtual gemmtrsm ukernels that work with broadcasting of B.
    - Disable right-side hemm, symm, trmm, and trmm3 in bli_family_power9.h.
    - Thanks to Nicholai Tukanov for providing these updates.

commit efa61a6c8b1cfa48781fc2e4799ff32e1b7f8f77
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Nov 29 16:17:04 2019 -0600

    Added missing bli_l3_sup_thread_decorator() symbol.
    
    Details:
    - Defined dummy versions of bli_l3_sup_thread_decorator() for Openmp
      and pthreads so that those builds don't fail when performing shared
      library linking (especially for Windows DLLs via AppVeyor). For now,
      these dummy implementations of bli_l3_sup_thread_decorator() are
      merely carbon-copies of the implementation provided for single-
      threaded execution (ie: the one found in bli_l3_sup_decor_single.c).
      Thus, an OpenMP or pthreads build will be able to use the gemmsup
      code (including the new selective packing functionality), as it did
      before 39fa7136, even though it will not actually employ any
      multithreaded parallelism.

commit 39fa7136f4a4e55ccd9796fb79ad5f121b872ad9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Nov 29 15:27:07 2019 -0600

    Added support for selective packing to gemmsup.
    
    Details:
    - Implemented optional packing for A or B (or both) within the sup
      framework (which currently only supports gemm). The request for
      packing either matrix A or matrix B can be made via setting
      environment variables BLIS_PACK_A or BLIS_PACK_B (to any
      non-zero value; if set, zero means "disable packing"). It can also
      be made globally at runtime via bli_pack_set_pack_a() and
      bli_pack_set_pack_b() or with individual rntm_t objects via
      bli_rntm_set_pack_a() and bli_rntm_set_pack_b() if using the expert
      interface of either the BLIS typed or object APIs. (If using the
      BLAS API, environment variables are the only way to communicate the
      packing request.)
    - One caveat (for now) with the current implementation of selective
      packing is that any blocksize extension registered in the _cntx_init
      function (such as is currently used by haswell and zen subconfigs)
      will be ignored if the affected matrix is packed. The reason is
      simply that I didn't get around to implementing the necessary logic
      to pack a larger edge-case micropanel, though this is entirely
      possible and should be done in the future.
    - Spun off the variant-choosing portion of bli_gemmsup_ref() into
      bli_gemmsup_int(), in bli_l3_sup_int.c.
    - Added new files, bli_l3_sup_packm_a.c, bli_l3_sup_packm_b.c, along
      with corresponding headers, in which higher-level packm-related
      functions are defined for use within the sup framework. The actual
      packm variant code resides in bli_l3_sup_packm_var.c.
    - Pass the following new parameters into var1n and var2m: packa, packb
      bool_t's, pointer to a rntm_t, pointer to a cntl_t (which is for now
      always NULL), and pointer to a thrinfo_t* (which for nowis the address
      of the global single-threaded packm thread control node).
    - Added panel strides ps_a and ps_b to the auxinfo_t structure so that
      the millikernel can query the panel stride of the packed matrix and
      step through it accordingly. If the matrix isn't packed, the panel
      stride of interest for the given millikernel will be set to the
      appropriate value so that the mkernel may step through the unpacked
      matrix as it normally would.
    - Modified the rv_6x8m and rv_6x8n millikernels to read the appropriate
      panel strides (ps_a and ps_b, respectively) instead of computing them
      on the fly.
    - Spun off the environment variable getting and setting functions into
      a new file, bli_env.c (with a corresponding prototype header). These
      functions are now used by the threading infrastructure (e.g.
      BLIS_NUM_THREADS, BLIS_JC_NT, etc.) as well as the selective packing
      infrastructure (e.g. BLIS_PACK_A, BLIS_PACK_B).
    - Added a static initializer for mem_t objects, BLIS_MEM_INITIALIZER.
    - Added a static initializer for pblk_t objects, BLIS_PBLK_INITIALIZER,
      for use within the definition of BLIS_MEM_INITIALIZER.
    - Moved the global_rntm object to bli_rntm.c and extern it where needed.
      This means that the function bli_thread_init_rntm() was renamed to
      bli_rntm_init_from_global() and relocated accordingly.
    - Added a new bli_pack.c function, which serves as the home for
      functions that manage the pack_a and pack_b fields of the global
      rntm_t, including from environment variables, just as we have
      functions to manage the threading fields of the global rntm_t in
      bli_thread.c.
    - Reorganized naming for files in frame/thread, which mostly involved
      spinning off the bli_l3_thread_decorator() functions into their own
      files. This change makes more sense when considering the further
      addition of bli_l3_sup_thread_decorator() functions (for now limited
      only to the single-threaded form found in the  _single.c file).
    - Explicitly initialize the reference sup handlers in both
      bli_cntx_init_haswell.c and bli_cntx_init_zen.c so that it's more
      obvious how to customize to a different handler, if desired.
    - Removed various snippets of disabled code.
    - Various comment updates.

commit bbb21fd0a9be8c5644bec37c75f9396eeeb69e48
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Nov 21 18:15:16 2019 -0600

    Tweaked SIAM/SC Best Prize language in README.md.

commit 043366f92d5f5f651d5e3371ac3adb36baf4adce
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Nov 21 18:13:51 2019 -0600

    Fixed typo in previous commit (SIAM/SC prize).

commit 05a4d583e65a46ff2a1100ab4433975d905d91f9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Nov 21 18:12:24 2019 -0600

    Added SIAM/SC prize to "What's New" in README.md.

commit 881b05ecd40c7bc0422d3479a02a28b1cb48383f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Nov 21 16:34:27 2019 -0600

    Fixed blastest failure for 'generic' subconfig.
    
    Details:
    - Fixed a subtle and complicated bug that only manifested via the BLAS
      test drivers in the generic subconfiguration, and possibly any other
      subconfiguration that did not register complex-domain gemm ukernels,
      or registered ONLY real-domain ukernels as row-preferential. This is
      a long story, but it boils down to an exception to the "transpose the
      operation to bring storage of C into agreement with ukernel pref"
      optimization in bli_hemm_front.c and bli_symm_front.c sabotaging the
      proper functioning of the 1m method, but only when the imaginary
      component of beta is zero. See the comments in issue #342 for more
      details. Thanks to Dave Love for identifying the commit in which this
      bug was introduced, and other feedback related to this bug.

commit 0c7165fb01cdebbc31ec00124d446161b289942f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Nov 14 16:48:14 2019 -0600

    Fixed obscure bug in bli_acquire_mpart_[mn]dim().
    
    Details:
    - Fixed a bug in bli_acquire_mpart_mdim(), bli_acquire_mpart_ndim(),
      and bli_acquire_mpart_mndim() that allowed the use of a blocksize b
      that is too large given the current row/column index (i.e., the i/j
      argument) and the size of the dimension being partitioned (i.e., the
      m/n argument). This bug only affected backwards partitioning/motion
      through the dimension and was the result of a misplaced conditional
      check-and-redirect to the backwards code path. It should be noted
      that this bug was discovered not because it manifested the way it
      could (thanks to the callers in BLIS making sure to always pass in
      the "correct" blocksize b), but could have manifested if the
      functions were used by 3rd party callers. Thanks to Minh Quan Ho for
      reporting the bug via issue #363.

commit fb8bef9982171ee0f60bc39e41a33c4d31fd59a9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Nov 14 13:05:28 2019 -0600

    Fixed copy-paste bug in bli_spackm_6xk_bb4_ref().
    
    Details:
    - Fixed a copy-paste bug in the new bli_spackm_6xk_bb4_ref() that
      manifested as failures in single-precision real level-3 operations.
      Also replaced the duplication factor constants with a const-qualifed
      varialbe, dfac, so that this won't happen again.
    - Changed NC for single-precision real from 4080 to 8160 so that the
      packed matrix B will have the same byte footprint in both single
      and double real.

commit 8f399c89403d5824ba767df1426706cf2d19d0a7
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Nov 12 15:32:57 2019 -0600

    Tweaked/added notes to docs/Multithreading.md.
    
    Details:
    - Added language to docs/Multithreading.md cautioning the reader about
      the nuances of setting multithreading parameters via the manual and
      automatic ways simultaneously, and also about how these parameters
      behave when multithreading is disabled at configure-time. These
      changes are an attempt to address the issues that arose in issue #362.
      Thanks to Jérémie du Boisberranger for his feedback on this topic.
    - CREDITS file update.

commit bdc7ee3394500d8e5b626af6ff37c048398bb27e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 11 15:47:17 2019 -0600

    Various fixes to support packing duplication in B.
    
    Details:
    - Added cpp macros to trmm and trmm3 front-ends to optionally force
      those operations to be cast so the structured matrix is on the left.
      symm and hemm already had such macros, but these too were renamed so
      that the macros were individual to the operation. We now have four
      such macros:
        #define BLIS_DISABLE_HEMM_RIGHT
        #define BLIS_DISABLE_SYMM_RIGHT
        #define BLIS_DISABLE_TRMM_RIGHT
        #define BLIS_DISABLE_TRMM3_RIGHT
      Also, updated the comments in the symm and hemm front-ends related to
      the first two macro guards, and added corresponding comments to the
      trmm and trmm3 front-ends for the latter two guards. (They all
      functionally do the same thing, just for their specific operations.)
      Thanks to Jeff Hammond for reporting the bugs that led me to this
      change (via #359).
    - Updated config/old/haswellbb subconfiguration (used to debug issues
      related to duplicating B during packing) to register: a packing
      kernel for single-precision real; gemmbb ukernels for s, c, and z;
      trsmbb ukernels for s, c, and z; gemmtrsmbb virtual ukrnels for s, c
      and z; and to use non-default cache and register blocksizes for s, c,
      and z datatypes. Also declared prototypes for all of the gemmbb,
      trsmbb, and gemmtrsmbb ukernel functions within the
      bli_cntx_init_haswellbb() function. This should, once applied to the
      power9 configuration, fix the remaining issues in #359.
    - Defined bli_spackm_6xk_bb4_ref(), which packs single reals with a
      duplication factor of 4. This function is defined in the same file as
      bli_dpackm_6xk_bb2_ref() (bli_packm_cxk_bb_ref.c).

commit 0eb79ca8503bd7b237994335b9687457227d3290
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Nov 8 14:48:48 2019 -0600

    Avoid unused variable warning in lread.c (#356).
    
    Details:
    - Replaced the line
    
        f = f;
    
      with
    
        ( void )f;
    
      for the unused variable 'f' in blastest/f2c/lread.c. (Hopefully)
      addresses issue #356, but since we don't use xlc who knows. Thanks
      to Jeff Hammond for reporting this.

commit f377bb448512f0b578263387eed7eaf8f2b72bb7
Author: Jérôme Duval <jerome.duval@gmail.com>
Date:   Thu Nov 7 23:39:29 2019 +0100

    Add Haiku to the known OS list (#361)

commit e29b1f9706b6d9ed798b7f6325f275df4e6be973
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Nov 5 17:15:19 2019 -0600

    Fixed failing testsuite gemmtrsm_ukr for power9.
    
    Details:
    - Added code that fixes false failures in the gemmtrsm_ukr module of the
      testsuite. The tests were failing because the computation (bli_gemv())
      that performs the numerical check was not able to properly travserse
      the matrix operands bx1 and b11 that are views into the micropanel of
      B, which has duplicated/broadcast elements under the power9 subconfig.
      (For example, a micropanel of B with duplication factor of 2 needs to
      use a column stride of 2; previously, the column stride was being
      interpreted as 1.)
    - Defined separate bli_obj_set_row_stride() and bli_obj_set_col_stride()
      static functions in bli_obj_macro_defs.h. (Previously, only the
      function bli_obj_set_strides() was defined. Amazing to think that we
      got this far without these former functions.)
    - Updated/expounded upon comments.

commit 49177a6b9afcccca5b39a21c6fd8e243525e1505
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 4 18:09:37 2019 -0600

    Fixed latent testsuite ukr module bugs for power9.
    
    Details:
    - Fixed a latent bug in the testsuite ukernel modules (gemm, trsm, and
      gemmtrsm) that only manifested once we began running with parameters
      that mimic those of power9. The problem was rooted in the way those
      modules were creating objects (and thus allocating memory) for the
      micropanel operands to the microkernel being tested. Since power9
      duplicates/broadcasts elements of B in memory, we needed an easy way
      of asking for more than one storage element per logical element in
      the matrix. I incorrectly expressed this as:
    
        bli_obj_create( datatype, k, n, ldbp, 1, &bp );
    
      The problem here is that bli_obj_create() is exceedingly efficient
      at calculating the size it passes to malloc() and doesn't allocate a
      full leading dimension's worth of elements for the last column (or
      row, in this example). This would normally not bother anyone since
      you're not supposed to access that memory anyway. But here, my
      attempted "hack" for getting extra elements was insufficient, and
      needed to be changed to:
    
        bli_obj_create( datatype, k, ldbp, ldbp, 1, &bp );
    
      That is, the extra elements needed to be baked into the dimensions of
      the matrix object in order to have the intended effect on the number
      of elements actually allocated. Thanks to Jeff Hammond for reporting
      this bug.
    - Fixed a typically harmless memory leak in the aforementioned test
      modules (the objects for the packed micropanels were not being freed).
    - Updated/expanded a common comment across all three ukr test modules.

commit c84391314d4f1b3f73d868f72105324e649f2a72
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 4 13:57:12 2019 -0600

    Reverted minor temp/wspace changes from b426f9e.
    
    Details:
    - Added missing license header to bli_pwr9_asm_macros_12x6.h.
    - Reverted temporary changes to various files in 'test' and 'testsuite'
      directories.
    - Moved testsuite/jobscripts into testsuite/old.
    - Minor whitespace/comment changes across various files.

commit 4870260f6b8c06d2cc01b7147d7433ddee213f7f
Author: Jeff Hammond <jeff.r.hammond@intel.com>
Date:   Mon Nov 4 11:55:47 2019 -0800

    blacklist GCC 5 and older for POWER9 (#360)

commit b426f9e04e5499c6f9c752e49c33800bfaadda4c
Author: Nicholai Tukanov <nicholai@utexas.edu>
Date:   Fri Nov 1 17:57:03 2019 -0500

    POWER9 DGEMM  (#355)
    
    Implemented and registered power9 dgemm ukernel.
    
    Details:
    - Implemented 12x6 dgemm microkernel for power9. This microkernel
      assumes that elements of B have been duplicated/broadcast during the
      packing step. The microkernel uses a column orientation for its
      microtile vector registers and thus implements column storage and
      general stride IO cases. (A row storage IO case via in-register
      transposition may be added at a future date.) It should be noted that
      we recommend using this microkernel with gcc and *not* xlc, as issues
      with the latter cropped up during development, including but not
      limited to slightly incompatible vector register mnemonics in the GNU
      extended inline assembly clobber list.

commit 58102aeaa282dc79554ed045e1b17a6eda292e15
Merge: 52059506 b9bc222b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 28 17:58:31 2019 -0500

    Merge branch 'amd'

commit 52059506b2d5fd4c3738165195abeb356a134bd4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 23 15:26:42 2019 -0500

    Added "How to Download BLIS" section to README.md.
    
    Details:
    - Added a new section to the README.md, just prior to the "Getting
      Started" section, titled "How to Download BLIS". This section details
      the user's options for obtaining BLIS and lays out four common ways
      of downloading the library. Thanks to Jeff Diamond for his feedback
      on this topic.

commit e6f0a96cc59aef728470f6850947ba856148c38a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 14 17:05:39 2019 -0500

    Updated README.md to ack Facebook as funder.

commit b9bc222bfc3db4f9ae5d7b3321346eed70c2c3fb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 14 16:38:15 2019 -0500

    Call bli_syrk_small() before error checking.
    
    Details:
    - In bli_syrk_front(), moved the conditional call to bli_syrk_check()
      (if error checking is enabled) and the conditional scaling of C by
      beta (if alpha is zero) so that they occur after, instead of before,
      the call to bli_syrk_small(). This sequencing now matches that of
      bli_gemm_small() in bli_gemm_front() and bli_trsm_small() in
      bli_trsm_front().

commit f0959a81dbcf30d8a1076d0a6348a9835079d31a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 14 15:46:28 2019 -0500

    When manual config is blacklisted, output error.
    
    Details:
    - Fixed and adjusted the logic in configure so that a more informative
      error message is output when a user runs './configure ... <conf>' and
      <conf> is present in the configuration blacklist. Previously, this
      particular set of conditions would result in the message:
    
        'user-specified configuration '' is NOT registered!
    
      That is, the error message mis-identified the targeted configuration
      as the empty string, and (more importantly) mis-identifies the
      problem. Thanks to Tze Meng Low for reporting this issue.
    - Fixed a nearby error messages somewhat unrelated to the issue above.
      Specifically, the wrong string was being printed when the error
      message was identifying an auto-detected configuration that did not
      appear to be registered.

commit 6218ac95a525eefa8921baf8d0d7057dfacebe9c
Merge: 0016d541 a617301f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 11 11:53:51 2019 -0500

    Merge branch 'master' into amd

commit 0016d541e6b0da617b1fae6612d2b314901b7a75
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 11 11:09:44 2019 -0500

    Changed -march=znver2 to =znver1 for clang on zen2.
    
    Details:
    - In config/zen2/make_defs.mk, changed the -march= flag so that
      -march=znver1 is used instead of -march=znver2 when CC_VENDOR is
      clang. (The gcc branch attempts to differentiate between various
      versions, but the equivalent version cutoffs for clang are not
      yet known by us, so we have to use a single flag for all versions
      of clang. Hopefully -march=znver1 is new enough. If not, we'll
      fall back to -march=bdver4 -mno-fma4 -mno-tbm -mno-xop -mno-lwp.)
      This issue was discovered thanks to AppVeyor.

commit e94a0530e5ac4c78a18f09105f40003be2b517f7
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 11 10:48:27 2019 -0500

    Corrected zen NC that was non-multiple of NR.
    
    Details:
    - Updated an incorrectly set cache blocksize NC for single real within
      config/zen/bli_cntx_init_zen.c that was non a multiple of the
      corresponding value of NR. This issue, which was caught by Travis CI,
      was introduced in 29b0e1e.

commit a2ffac752076bf55eb8c1fe2c5da8d9104f1f85b
Merge: 1cfe8e25 29b0e1ef
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 11 10:31:18 2019 -0500

    Merge branch 'amd-master' into amd

commit 29b0e1ef4e8b84ce76888d73c090009b361f1306
Merge: 1cfe8e25 fdce1a56
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 11 10:24:24 2019 -0500

    Code review + tweaks to AMD's AOCL 2.0 PR (#349).
    
    Details:
    - NOTE: This is a merge commit of 'master' of git://github.com/amd/blis
      into 'amd-master' of flame/blis.
    - Fixed a bug in the downstream value of BLIS_NUM_ARCHS, which was
      inadvertantly not incremented when the Zen2 subconfiguration was
      added.
    - In bli_gemm_front(), added a missing conditional constraint around the
      call to bli_gemm_small() that ensures that the computation precision
      of C matches the storage precision of C.
    - In bli_syrk_front(), reorganized and relocated the notrans/trans logic
      that existed around the call to bli_syrk_small() into bli_syrk_small()
      to minimize the calling code footprint and also to bring that code
      into stylistic harmony with similar code in bli_gemm_front() and
      bli_trsm_front(). Also, replaced direct accessing of obj_t fields with
      proper accessor static functions (e.g. 'a->dim[0]' becomes
      'bli_obj_length( a )').
    - Added #ifdef BLIS_ENABLE_SMALL_MATRIX guard around prototypes for
      bli_gemm_small(), bli_syrk_small(), and bli_trsm_small(). This is
      strictly speaking unnecessary, but it serves as a useful visual cue to
      those who may be reading the files.
    - Removed cpp macro-protected small matrix debugging code from
      bli_trsm_front.c.
    - Added a GCC_OT_9_1_0 variable to build/config.mk.in to facilitate gcc
      version check for availability of -march=znver2, and added appropriate
      support to configure script.
    - Cleanups to compiler flags common to recent AMD microarchitectures in
      config/zen/amd_config.mk, including: removal of -march=znver1 et al.
      from CKVECFLAGS (since the -march flag is added within make_defs.mk);
      setting CRVECFLAGS similarly to CKVECFLAGS.
    - Cleanups to config/zen/bli_cntx_init_zen.c.
    - Cleanups, added comments to config/zen/make_defs.mk.
    - Cleanups to config/zen2/make_defs.mk, including making use of newly-
      added GCC_OT_9_1_0 and existing GCC_OT_6_1_0 to choose the correct
      set of compiler flags based on the version of gcc being used.
    - Reverted downstream changes to test/test_gemm.c.
    - Various whitespace/comment changes.

commit a617301f9365ac720ff286514105d1b78951368b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Oct 8 17:14:05 2019 -0500

    Updates to docs/CodingConventions.md.

commit 171f10069199f0cd280f18aac184546bd877c4fe
Merge: 702486b1 05d58edf
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 4 11:18:23 2019 -0500

    Merge remote-tracking branch 'loveshack/emacs'

commit 702486b12560b5c696ba06de9a73fc0d5107ca44
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 2 16:35:41 2019 -0500

    Removed stray FAQ section introduced in 1907000.

commit 1907000ad6ea396970c010f07ae42980b7b14fa0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 2 16:31:54 2019 -0500

    Updated to FAQ (AMD-related questions).
    
    Details:
    - Added a couple potential frequently-asked questions/answers releated
      to AMD's fork of BLIS.
    - Updated existing answers to other questions.

commit 834f30a0dad808931c9d80bd5831b636ed0e1098
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 2 12:45:56 2019 -0500

    Mention mixeddt paper in docs/MixedDatatypes.md.

commit 05d58edfe0ea9279971d74f17a5f7a69c4672ed5
Author: Dave Love <dave.love@manchester.ac.uk>
Date:   Wed Oct 2 10:33:44 2019 +0100

    Note .dir-locals.el in docs

commit 531110c339f199a4d165d707c988d89ab4f5bfe8
Author: Dave Love <dave.love@manchester.ac.uk>
Date:   Wed Oct 2 10:16:22 2019 +0100

    Modify Emacs config
    Confine it to cc-mode and add comment-start/end.

commit 4bab365cab98202259c70feba6ec87408cba28d8
Author: Dave Love <dave.love@manchester.ac.uk>
Date:   Tue Oct 1 19:22:47 2019 +0000

    Add .dir-locals.el for Emacs (#348)
    
    A minimal version that could probably do with extending, but at least
    gets the indentation roughly right.

commit 4ec8dad66b3d37b0a2b47d19b7144bb62d332622
Author: Dave Love <dave.love@manchester.ac.uk>
Date:   Thu Sep 26 16:27:53 2019 +0100

    Add .dir-locals.el for Emacs
    
    A minimal version that could probably do with extending, but at least
    gets the indentation roughly right.

commit bc16ec7d1e2a30ce4a751255b70c9cbe87409e4f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Sep 23 15:37:33 2019 -0500

    Set execute bits of shared library at install-time.
    
    Details:
    - Modified the 0644 octal code used during installation of shared
      libraries to 0755 (for Linux/OSX only). Thanks to Adam J. Stewart
      for reporting this issue via #343.
    - CREDITS file update.

commit c60db26aee9e7b4e5d0b031b0881e58d23666b53
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 17 18:04:17 2019 -0500

    Fixed bad loop counter in bli_[cz]scal2bbs_mxn().
    
    Details:
    - Fixed a typo in the loop counter for the 'd' (duplication) dimension
      in the complex macros of frame/include/level0/bb/bli_scal2bbs_mxn.h.
      They shouldn't be used by anyone yet, but thankfully clang via
      AppVeyor spit out warnings that alerted me to the issue.

commit c766c81d628f0451d8255bf5e4b8be0a4ef91978
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 17 18:00:29 2019 -0500

    Added missing schema arg to knl packm kernels.
    
    Details:
    - Added the pack_t schema argument to the knl packm kernel functions.
      This change was intended for inclusion in 31c8657. (Thank you SDE +
      Travis CI.)

commit 31c8657f1d6d8f6efd8a73fd1995e995fc56748b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 17 17:42:10 2019 -0500

    Added support for pre-broadcast when packing B.
    
    Details:
    - Added support for being able to duplicate (broadcast) elements in
      memory when packing matrix B (ie: the left-hand operand) in level-3
      operations. This turns out advantageous for some architectures that
      can afford the cost of the extra bandwidth and somehow benefit from
      the pre-broadcast elements (and thus being able to avoid using
      broadcast-style load instructions on micro-rows of B in the gemm
      microkernel).
    - Support optionally disabling right-side hemm and symm. If this occurs,
      hemm_r is implemented in terms of hemm_l (and symm_r in terms of
      symm_l). This is needed when broadcasting during packing because the
      alternative--supporting the broadcast of B while also allowing matrix
      B to be Hermitian/symmetric--would be an absolute mess.
    - Support alignment factors for packed blocks of A, B, and C separately
      (as well as for general-purpose buffers). In addition, we support
      byte offsets from those alignment values (which is different from
      aligning by align+offset bytes to begin with). The default alignment
      values are BLIS_PAGE_SIZE in all four cases, with the offset values
      defaulting to zero.
    - Pass pack_t schema into bli_?packm_cxk() so that it can be then passed
      into the packm kernel, where it will be needed by packm kernels that
      perform broadcasts of B, since the idea is that we *only* want to
      broadcast when packing micropanels of B and not A.
    - Added definition for variadic bli_cntx_set_l3_vir_ukrs(), which can be
      used to set custom virtual level-3 microkernels in the cntx_t, which
      would typically be done in the bli_cntx_init_*() function defined in
      the subconfiguration of interest.
    - Added a "broadcast B" kernel function for use with NP/NR = 12/6,
      defined in in ref_kernels/1m/bli_packm_cxk_bb_ref.c.
    - Added a gemm, gemmtrsm, and trsm "broadcast B" reference kernels
      defined in ref_kernels/3/bb. (These kernels have been tested with
      double real with NP/NR = 12/6.)
    - Added #ifndef ... #endif guards around several macro constants defined
      in frame/include/bli_kernel_macro_defs.h.
    - Defined a few "broadcast B" static functions in
      frame/include/level0/bb for use by "broadcast B"-style packm reference
      kernels. For now, only the real domain kernels are tested and fully
      defined.
    - Output the alignment and offset values for packed blocks of A and B
      in the testsuite's "BLIS configuration info" section.
    - Comment updates to various files.
    - Bumped so_version to 3.0.0.

commit fd9bf497cd4ff73ccdfc030ba037b3cb2f1c2fad
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 17 15:45:24 2019 -0500

    CREDITS file update.

commit 6c8f2d1486ce31ad3c2083e5c2035acfd4409a43
Author: ShmuelLevine <shmuel.levine@gmail.com>
Date:   Tue Sep 17 16:43:46 2019 -0400

    Fix description for function bli_*pxby2v (#340)
    
    Fix typo in BLISTypedAPI.md for bli_?axpy2v() description.

commit b5679c1520f8ae7637b3cc2313133461f62398dc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 17 14:00:37 2019 -0500

    Inserted Multithreading links into BuildSystem.md.
    
    Details:
    - Inserted brief disclaimers about default disabled multithreading
      and default single-threadedness to BuildSystem.md along with links to
      the Multithreading.md document. Thanks to Jeff Diamond for suggesting
      these additions.
    - Trivial reword of sentence regarding automatically-detected
      architectures.

commit f4f5170f8482c94132832eb3033bc8796da5420b
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Wed Sep 11 07:34:48 2019 -0500

    Update README.md (#338)

commit 1cfe8e2562e5e50769468382626ce36b734741c1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Sep 5 16:08:30 2019 -0500

    Reimplemented bli_cpuid_query() for ARM.
    
    Details:
    - Rewrote bli_cpuid_query() for ARM architectures to use stdio-based
      functions such as fopen() and fgets() instead of popen(). The new code
      does more or less the same thing as before--searches /proc/cpuinfo for
      various strings, which are then parsed in order to determine the
      model, part number, and features. Thanks to Dave Love for suggesting
      this change in issue #335.

commit 7c7819145740e96929466a248d6375d40e397e19
Author: Devin Matthews <damatthews@smu.edu>
Date:   Fri Aug 30 16:52:09 2019 -0500

    Always use sqsumv to compute normfv. (#334)
    
    * Always use sqsumv to compute normfv on MacOS.
    
    * Unconditionally disable the "dot trick" in normfv.
    
    * Added explanatory comment to normfv definition.
    
    Details:
    - Added a comment above the unconditional disabling of the dotv-based
      implementation to normfv. Thanks to Roman Yurchak, Devin Matthews,
      and Isuru Fernando in helping with this improvement.
    - CREDITS file update.

commit 80e6c10b72d50863b4b64d79f784df7befedfcd1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 29 12:12:08 2019 -0500

    Added reproduction section to Performance docs.
    
    Details:
    - Added section titled "Reproduction" to both Performance.md and
      PerformanceSmall.md that briefly nudges the motivated reader in the
      right direction if he/she wishes to run the same performance
      benchmarks used to produce the graphs shown in those documents.
      Thanks to Dave Love for making this suggestion.

commit 14cb426414856024b9ae0f84ac21efcc1d329467
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Aug 28 17:04:33 2019 -0500

    Updated OpenBLAS, Eigen sup results.
    
    Details:
    - Updated the results shown in docs/PerformanceSmall.md for OpenBLAS and
      Eigen.

commit b02e0aae8ce2705e91023b98ed416cd05430a78e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Aug 27 14:37:46 2019 -0500

    Updated test drivers to iterate backwards.
    
    Details:
    - Updated test driver source in test, test/3, test/1m4m, and
      test/mixeddt to iterate through the problem space backwards. This
      can help avoid certain situations where the CPU frequency does not
      immediately throttle up to its maximum. Thanks to Robert van de
      Geijn for recommending this fix (originally made to test/sup drivers
      in 57e422a).
    - Applied off-by-one matlab output bugfix from b6017e5 to test drivers
      in test, test/3, test/1m4m, and test/mixeddt directories.

commit b6017e53f4b26c99b14cdaa408351f11322b1e80
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Aug 27 14:18:14 2019 -0500

    Bugfix of output text + tweaks to test/sup driver.
    
    Details:
    - Fixed an off-by-one bug in the output of matlab row indices in
      test/sup/test_gemm.c that only manifested when the problem size
      increment was equal to 1.
    - Disabled the building of rrc, rcr, rcc, crr, crc, and ccr storage
      combinations for blissup drivers in test/sup. This helps make the
      building of drivers complete sooner.
    - Trivial changes to test/sup/runme.sh.

commit 138d403b6bb15e687a3fe26d3d967b8ccd1ed97b
Author: Devin Matthews <damatthews@smu.edu>
Date:   Mon Aug 26 18:11:27 2019 -0500

    Use -funsafe-math-optimizations and -ffp-contract=fast for all reference kernels when using gcc or clang. (#331)

commit d5a05a15a7fcc38fb2519031dcc62de8ea4a530c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Aug 26 16:54:31 2019 -0500

    Cropped whitespace from new sup graphs.
    
    Details:
    - Previously forgot crop whitespace from the new .png graphs
      added/updated in docs/graphs/sup.

commit a6c80171a353db709e43f9e6e7a3da87ce4d17ed
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Aug 26 16:51:31 2019 -0500

    Fixed contents links in docs/PerformanceSmall.md.
    
    Details:
    - Corrected links in contents section of docs/PerformanceSmall.md,
      which were erroneously directing readers to the corresponding
      sections of docs/Performance.md.

commit 40781774df56a912144ef19cc191ed626a89f0de
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Aug 26 16:47:37 2019 -0500

    Updated sup performance graphs with libxsmm.
    
    Details:
    - Added libxsmm to column-stored sup graphs presented in
      docs/PerformanceSmall.md.
    - Updated sup results for BLASFEO.
    - Added sup results for Lonestar5 (Haswell).
    - Addresses issue #326.

commit bfddf671328e7e372ac7228f72ff2d9d8e03ae18
Author: figual <figual@ucm.es>
Date:   Mon Aug 26 12:01:33 2019 +0200

    Fixed context registration for Cortex A53 (#329).

commit 4a0a6e89c568246d14de4cc30e3ff35aac23d774
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Aug 24 15:25:16 2019 -0500

    Changed test/sup alpha to 1; test libxsmm+netlib.
    
    Details:
    - Changed the value of alpha to 1.0 in test/sup/test_gemm.c. This is
      needed because libxsmm currently only optimizes gemm operations where
      alpha is unit (and beta is unit or zero).
    - Adjusted the test/sup/Makefile to test libxsmm with netlib BLAS as its
      fallback library. This is the library that will be called the
      problem dimensions are deemed too large, or any other criteria for
      optimization are not met. (This was done not because it is realistic,
      but rather so that it would be very clear when libxsmm ceased handling
      gemm calls internally when the data are graphed.)

commit 7aa52b57832176c5c13a48e30a282e09ecdabf73
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Aug 23 16:12:50 2019 -0500

    Use libxsmm API in test/sup; add missing -ldl.
    
    Details:
    - Switch the driver source in test/sup so that libxsmm_?gemm() is called
      instead of ?gemm_() when compiling for / linking against libxsmm.
      libxsmm's documentation isn't clear on whether it is even *trying* to
      provide BLAS API compatibility, and I got tired of trying to figure it
      out.
    - Added missing -ldl in LDFLAGS when linking against libxsmm.

commit 57e422aa168bee7416965265c93fcd4934cd7041
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Aug 23 14:17:52 2019 -0500

    Added libxsmm support to test/sup drivers.
    
    Details:
    - Modified test/sup/Makefile to build drivers that test the performance
      of skinny/small problems via libxsmm.
    - Modified test/sup/runme.sh to run aforementioned drivers.
    - Modified test/sup/test_gemm.c so that problem sizes are tested in
      reverse order (from largest to smallest). This can help avoid certain
      situations where the CPU frequency does not immediately throttle up
      to its maximum. Thanks to Robert van de Geijn for recommending this
      fix.

commit 661681fe33978acce370255815c76348f83632bc
Merge: 2f387e32 ef0a1a0f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 22 14:29:50 2019 -0500

    Merge branch 'master' of github.com:flame/blis

commit 2f387e32ef5f9a17bafb5076dc9f66c38b52b32d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 22 14:27:30 2019 -0500

    Added Eigen -march=native hack to perf docs.
    
    Details:
    - Spell out the hack given to me by Sameer Agarwal in order to get Eigen
      to build with -march=native (which is critically important for Eigen)
      in docs/Performance.md and docs/PerformanceSmall.md.

commit ef0a1a0faf683fe205f85308a54a77ffd68a9a6c
Author: Devin Matthews <damatthews@smu.edu>
Date:   Wed Aug 21 17:40:24 2019 -0500

    Update do_sde.sh (#330)
    
    * Update do_sde.sh
    
    Automatically accept SDE license and download directly from Intel
    
    * Update .travis.yml
    
    [ci skip]
    
    * Update .travis.yml
    
    Enable SDE testing for PRs.

commit 0cd383d53a8c4a6871892a0395591ef5630d4ac0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Aug 21 13:39:05 2019 -0500

    Corrected variable type and comment update.
    
    Details:
    - Forgot to save all changes from bli_gemmtrsm4m1_ref.c before commit
      in 8122f59. Fixed type mismatch and referenced github issue in
      comment.

commit 8122f59745db780987da6aa1e851e9e76aa985e0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Aug 21 13:22:12 2019 -0500

    Pacify 'restrict' warning in gemmtrsm4m1 ref ukr.
    
    Details:
    - Previously, some versions of gcc would complain that the same
      pointer, one_r, is being passed in for both alpha and beta in the
      fourth call to the real gemm ukernel in bli_gemmtrsm4m1_ref.c. This
      is understandable since the compiler knows that the real gemm ukernel
      qualifies all of its floating-point arguments (including alpha and
      beta) with restrict. A small hack has been inserted into the file
      that defines a new variable to store the value 1.0, which is now used
      in lieu of one_r for beta in the fourth call to the real gemm ukernel,
      which should pacify the compiler now. Thanks to Dave Love for
      reporting this issue (#328) and for Devin Matthews for offering his
      'restrict' expertise.

commit e8c6281f139bdfc9bd68c3b36e5e89059b0ead2e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Aug 21 12:38:53 2019 -0500

    Add -march support for specific gcc version ranges.
    
    Details:
    - Added logic to configure that checks the version of the compiler
      against known version ranges that could cause problems later in the
      build process. For example, versions of gcc older than 4.9.0 use
      different -march labels than version 4.9.0 or later
      ('-march=corei7-avx' vs '-march=sandybridge', respectively).
      Similarly, before 6.1, compilation on Zen was possible, but you
      need to start with -march=bdver4 and then disable instruction sets
      that were discarded during the transition from Excavator to Zen. So
      now, configure substitutes 'yes'/'no' values into anchors in
      config.mk.in, which sets various make variables (e.g. GCC_OT_4_9_0),
      which can be accessed and branched upon by the various
      configurations' make_defs.mk files when setting their compiler flags.
    - Updated config/haswell/make_defs.mk to branch on GCC_OT_4_9_0.
    - Updated config/sandybridge/make_defs.mk to branch on GCC_OT_4_9_0.
    - Updated config/zen/make_defs.mk to branch on GCC_OT_6_1_0.

commit e6ac4ebcb6e6a372820e7f509c0af3342966b84a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Aug 20 13:49:47 2019 -0500

    Added page size, source location to perf docs.
    
    Details:
    - Added the page size, as returned via 'getconf -a | grep PAGE_SIZE',
      and the location of the performance drivers to docs/Performance.md
      (test/3) and docs/PerformanceSmall.md (test/sup). Thanks to Dave
      Love for suggesting these additions in #325.

commit fdce1a5648d69034fab39943100289323011c36f
Author: Meghana <Meghana.Vankadari@amd.com>
Date:   Wed Jul 24 15:04:41 2019 +0530

    changed gcc version check condition from 'ifeq' to 'if greater or equal'
    
    Change-Id: Ie4c461867829bcc113210791bbefb9517e52c226

commit c9486e0c4f82cd9f58f5ceb71c0df039e9970a20
Author: Meghana <Meghana.Vankadari@amd.com>
Date:   Wed Jul 24 09:45:17 2019 +0530

    code to detect version of gcc and set flags accordingly for zen2
    
    Change-Id: I29b0311d0000dee1a2533ee29941acf53f9e9f34

commit 54afe3dfe6828a1aff65baabbf14c98d92e50692
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 23 16:54:28 2019 -0500

    Added "Education and Learning" ToC entry to README.

commit 9f53b1ce7ac702e84e71801fe96986f6aa16040e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 23 16:50:35 2019 -0500

    Added "Education and Learning" section to README.
    
    Details:
    - Added a short section after the Intro of the README.md file titled
      "Education and Learning" that directs interested readers to the
      "LAFF-On Programming for High-Performance" massive open online course
      (MOOC) hosted via edX.

commit deda4ca8a094ee18d7c7c45e040e8ef180f33a48
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jul 22 13:59:05 2019 -0500

    Added test/1m4m driver directory.
    
    Details:
    - Added a new standalone test driver directory named '1m4m' that can
      build and run performance experiments for BLIS 1m, 4m1a, assembly,
      OpenBLAS, and the vendor library (MKL). This new driver directory
      was used to regenerate performance results for the 1m paper.
    - Added alternate (commented-out) cache blocksizes to
      config/haswell/bli_cntx_init_haswell.c. These blocksizes tend to
      work well on an a 12-core Intel Xeon E5-2650 v3.

commit dcc0ce12fde4c6dca2b4764a1922a2ab19725867
Author: Meghana <Meghana.Vankadari@amd.com>
Date:   Mon Jul 22 17:12:01 2019 +0530

    Added a global Makefile for AMD architectures in config/zen folder
    This Makefile(amd_config.mk) has all the flags that are common to EPYC series
    
    Change-Id: Ic02c60a8293ccdd37f0f292e631acd198e6895de

commit af17bca26a8bd3dcbee8ca81c18d7b25de09c483
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jul 19 14:46:23 2019 -0500

    Updated haswell MC cache blocksizes.
    
    Details:
    - Updated the default MC cache blocksizes used by the haswell subconfig
      for both row-preferential (the default) and column-preferential
      microkernels.

commit b5e9bce4dde5bf014dd9771ae741048e1f6c7748
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jul 19 14:42:37 2019 -0500

    Updated -march flags for sandybridge, haswell.
    
    Details:
    - Updated the '-march=corei7-avx' flag in the sandybridge subconfig
      to '-march=sandybridge' and the '-march=core-avx2' flag in the
      haswell subconfig to '-march=haswell'. The older flags were used
      by older versions of gcc and should have been updated to the newer
      forms a long time ago. (The older flags were clearly working, even
      though they are no longer documented in the gcc man page.)

commit c22b9dba5859a9fc94c8431eccc9e4eb9be02be1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 16 13:14:47 2019 -0500

    More updates to comments in testsuite modules.
    
    Details:
    - Updated most comments in testsuite modules that describe how the
      correctness test is performed so that it is clear whether the vector
      (normfv) or matrix (normfm) form of Frobenius norm is used.

commit c4cc6fa702f444a05963db01db51bc7d6669e979
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 16 13:00:35 2019 -0500

    New cntx_t blksz "set" functions + misc tweaks.
    
    Details:
    - Defined two new static functions in bli_cntx.h:
        bli_cntx_set_blksz_def_dt()
        bli_cntx_set_blksz_max_dt()
      which developers may find convenient when experimenting with different
      values of cache blocksizes.
    - Updated one- and two-socket multithreaded problem size range and
      increment values in test/3/Makefile.
    - Changed default to column storage in test/3/test_gemm.c.
    - Fixed typo in comment in testsuite/src/test_subm.c.

commit b84cee29f42855dc1f263e42b83b1a46ac8def87
Merge: 1f80858a c7dd6e6c
Author: Meghana Vankadari <Meghana.Vankadari@amd.com>
Date:   Mon Jul 8 02:03:07 2019 -0400

    Merge "Added compiler flags for vanilla clang" into amd-staging-rome2.0

commit 1f80858abf5ca220b2998fbe6f9b06c32d3864c3
Author: kdevraje <kiran.Devrajegowda@amd.com>
Date:   Fri Jul 5 16:05:11 2019 +0530

     This checkin solves the dgemm performance issue jira ticket CPUPL 458, as #else was missed during integration, it was always following else path to get the block sizes
    
    Change-Id: I0084b5856c2513ab1066c08c15b5086db6532717

commit c7dd6e6cd2f910cbefcdc1e04a5adeb919a23de0
Author: Meghana <meghana.vankadari@amd.com>
Date:   Thu Jul 4 09:32:51 2019 +0530

    Added compiler flags for vanilla clang
    
    Change-Id: I13c00b4c0d65bbda4c929848fd48b0ab611952ab

commit 2acd49b76457635625a01e31c2abc8902b23cf51
Author: Meghana <meghana.vankadari@amd.com>
Date:   Mon Jul 1 15:42:38 2019 +0530

    fix for test failures using AOCC 2.0
    
    Change-Id: If44eaccc64bbe96bbbe1d32279b1b5773aba08d1

commit ceee2f973ebe115beca55ca77f9e3ce36b14c28a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 24 17:47:40 2019 -0500

    Fixed thrinfo_t printing bug for small problems.
    
    Details:
    - Fixed a bug in bli_l3_thrinfo_print_gemm_paths() and
      bli_l3_thrinfo_print_trsm_paths(), defined in bli_l3_thrinfo.c,
      whereby subnodes of the thrinfo_t tree are "dereferenced" near the
      beginning of the functions, which may lead to segfaults in certain
      situations where the thread tree was not fully formed because the
      matrix problem was too small for the level of parallelism specified.
      (That is, too small because some problems were assigned no work due
      to the smallest units in the m and n dimensions being defined by the
      register blocksizes mr and nr.) The fix requires several nested levels
      of if statements, and this is one of those few instances where use of
      goto statements results in (mostly) prettier code, especially in the
      case of _gemm_paths(). And while it wasn't necessary, I ported this
      goto usage to the loop body that prints the thrinfo_t work_id and
      comm_id values for each thread. Thanks to Nicholai Tukanov for helping
      to find this bug.

commit cac127182dd88ed0394ad81e6b91b897198e168a
Merge: 565fa385 3a45ecb1
Author: kdevraje <Kiran.Devrajegowda@amd.com>
Date:   Mon Jun 24 13:01:27 2019 +0530

    Merge branch 'amd-staging-rome2.0' of ssh://git.amd.com:29418/cpulibraries/er/blis
    with public repo commit id 565fa3853b381051ac92cff764625909d105644d.
    
    Change-Id: I68b9824b110cf14df248217a24a6191b3df79d42

commit c152109e9a3b1cd74760e8a3215a676d25c18d2e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jun 19 13:23:24 2019 -0500

    Updated BLASFEO results in PerformanceSmall.md.
    
    Details:
    - Updated the BLASFEO performance graphs shown in PerformanceSmall.md
      using a new commit of BLASFEO (2c9f312); updated PerformanceSmall.md
      accordingly.
    - Updated test/sup/octave/plot_l3sup_perf.m so that the .m files
      containing the mpnpkp results do not need to be preprocessed in order
      to plot half the problem size range (ie: up to 400 instead of the
      800 range of the other shape cases).
    - Trivial updates to runme.m.

commit 4d19c98110691d33ecef09d7e1b97bd1ccf4c420
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jun 8 11:02:03 2019 -0500

    Trivial change to MixedDatatypes.md link text.

commit 24965beabe83e19acf62008366097a7f198d4841
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jun 8 11:00:22 2019 -0500

    Fixed typo in README.md's MixedDatatypes.md link.

commit 50dc5d95760f41c5117c46f754245edc642b2179
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jun 7 13:10:16 2019 -0500

    Adjust -fopenmp-simd for icc's preferred syntax.
    
    Details:
    - Use -qopenmp-simd instead of -fopenmp-simd when compiling with Intel
      icc. Recall that this option is used for SIMD auto-vectorization in
      reference kernels only. Support for the -f option has been completely
      deprecated and removed in newer versions of icc in favor of -q. Thanks
      to Victor Eijkhout for reporting this issue and suggesting the fix.

commit ad937db9507786874c801b41a4992aef42d924a1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jun 7 11:34:08 2019 -0500

    Added missing #include "bli_family_thunderx2.h".
    
    Details:
    - Added a cpp-conditional directive block to bli_arch_config.h that
      #includes "bli_family_thunderx2.h". The code has been missing since
      adf5c17f. However, this never manifested as an error because the file
      is virtually empty and not needed for thunderx2 (or most subconfigs).
      Thanks to Jeff Diamond for helping to spot this.

commit ce671917b2bc24895289247feef46f6fdd5020e7
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jun 6 14:17:21 2019 -0500

    Fixed formatting/typo in docs/PerformanceSmall.md.

commit 86c33a4eb284e2cf3282a1809be377785cdb3703
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jun 5 11:43:55 2019 -0500

    Tweaked language in README.md related to sup/AMD.

commit cbaa22e1ca368d36a8510f2b4ecd6f1523d1e1f3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jun 4 16:06:58 2019 -0500

    Added BLASFEO results to docs/PerformanceSmall.md.
    
    Details:
    - Updated the graphs linked in PerformanceSmall.md with BLASFEO results,
      and added documenting language accordingly.
    - Updated scripts in test/sup/octave to plot BLASFEO data.
    - Minor tweak to language re: how OpenBLAS was configured for
      docs/Performance.md.

commit 763fa39c3088c0e2c0155675a3ca868a58bffb30
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jun 4 14:46:45 2019 -0500

    Minor tweaks to test/sup.
    
    Details:
    - Changed starting problem and increment from 16 to 4.
    - Added 'lll' (square problems) to list of problem size shapes to
      compile and run with.
    - Define BLASFEO location and added BLASFEO-related definitions.

commit 5e1e696003c9151b1879b910a1957b7bdd7b0deb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 3 18:37:20 2019 -0500

    CHANGELOG update (0.6.0)

commit 18c876b989fd0dcaa27becd14e4f16bdac7e89b3 (tag: 0.6.0)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 3 18:37:19 2019 -0500

    Version file update (0.6.0)

commit 0f1b3bf49eb593ca7bb08b68a7209f7cd550f912
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 3 18:35:19 2019 -0500

    ReleaseNotes.md update in advance of next version.
    
    Details:
    - Updated ReleaseNotes.md in preparation for next version.
    - CREDITS file update.

commit 27da2e8400d900855da0d834b5417d7e83f21de1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 3 17:14:56 2019 -0500

    Minor edits to docs/PerformanceSmall.md.
    
    Details:
    - Added performance analysis to "Comments" section of both Kaby Lake and
      Epyc sections.
    - Added emphasis to certain passages.

commit 09ba05c6f87efbaadf085497dc137845f16ee9c5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 3 16:53:19 2019 -0500

    Added sup performance graphs/document to 'docs'.
    
    Details:
    - Added a new markdown document, docs/PerformanceSmall.md, which
      publishes new performance graphs for Kaby Lake and Epyc showcasing
      the new BLIS sup (small/skinny/unpacked) framework logic and kernels.
      For now, only single-threaded dgemm performance is shown.
    - Reorganized graphs in docs/graphs into docs/graphs/large, with new
      graphs being placed in docs/graphs/sup.
    - Updates to scripts in test/sup/octave, mostly to allow decent output
      in both GNU octave and Matlab.
    - Updated README.md to mention and refer to the new PerformanceSmall.md
      document.

commit 6bf449cc6941734748034de0e9af22b75f1d6ba1
Merge: abd8a9fa a4e8801d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri May 31 17:42:40 2019 -0500

    Merge branch 'amd'

commit a4e8801d08d81fa42ebea6a05a990de8dcedc803
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri May 31 17:30:51 2019 -0500

    Increased MT sup threshold for double to 201.
    
    Details:
    - Fine-tuned the double-precision real MT threshold (which controls
      whether the sup implementation kicks for smaller m dimension values)
      from 180 to 201 for haswell and 180 to 256 for zen.
    - Updated octave scripts in test/sup/octave to include a seventh column
      to display performance for m = n = k.

commit 3a45ecb15456249c30ccccd60e42152f355615c1
Merge: 3f867c96 b69fb0b7
Author: Kiran Devrajegowda <Kiran.Devrajegowda@amd.com>
Date:   Fri May 31 06:47:02 2019 -0400

    Merge "Added back BLIS_ENABLE_ZEN_BLOCK_SIZES macro to zen configuration, this is same as release 1.3. This was added before to improve DGEMM Multithreaded scalability on Naples for when number of threads is greater than 16. By mistake this got deleted in many changes done for 2.0 release, now we are adding this change back., in bli_gemm_front.c - code cleanup" into amd-staging-rome2.0

commit b69fb0b74a4756168de270fc9b18f7cf7aa57f17
Author: Kiran Varaganti <Kiran.Varaganti@amd.com>
Date:   Fri May 31 15:14:22 2019 +0530

    Added back BLIS_ENABLE_ZEN_BLOCK_SIZES macro to zen configuration, this is same as release 1.3. This was added before to improve DGEMM Multithreaded scalability on Naples for when number of threads is greater than 16. By mistake this got deleted in many changes done for 2.0 release, now we are adding this change back., in bli_gemm_front.c - code cleanup
    
    Change-Id: I9f5d8225254676a99c6f2b09a0825e545206d0fc

commit 3f867c96caea3bbbbeeff1995d90f6cf8c9895fb
Author: kdevraje <Kiran.Devrajegowda@amd.com>
Date:   Fri May 31 12:22:44 2019 +0530

     When running HPL with pure MPI without DGEMM Threading (Single Threaded BLIS ), making this macro 1 gives best performance.wq
    
    Change-Id: I24fd0bf99216f315e49f1c74c44c3feaffd7078d

commit abd8a9fa7df4569aa2711964c19888b8e248901f (origin/pfhp)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 28 12:49:44 2019 -0500

    Inadvertantly hidden xerbla_() in blastest (#313).
    
    Details:
    - Attempted a fix to issue #313, which reports that when building only
      a shared library (ie: static library build is disabled), running the
      BLAS test drivers can fail because those drivers provide their own
      local version of xerbla_() as a clever (albeit still rather hackish)
      way of checking the error codes that result from the individual tests.
      This local xerbla_() function is never found at link-time because the
      BLAS test drivers' Makefile imports BLIS compilation flags via the
      get-user-cflags-for() function, which currently conveys the
      -fvisibility=hidden flag, which hides symbols unless they are
      explicitly annotated for export. The -fvisibility=hidden flag was
      only ever intended for use when building BLIS (not for applications),
      and so the attempted solution here is to omit the symbol export
      flag(s) from get-user-cflags-for() by storing the symbol export
      flag(s) to a new BULID_SYMFLAGS variable instead of appending it
      to the subconfigurations' CMISCFLAGS variable (which is returned by
      every get-*-cflags-for() function). Thanks to M. Zhou for reporting
      this issue and also to Isuru Fernando for suggesting the fix.
    - Renamed BUILD_FLAGS to BUILD_CPPFLAGS to harmonize with the newly
      created BUILD_SYMFLAGS.
    - Fixed typo in entry for --export-shared flag in 'configure --help'
      text.

commit 13806ba3b01ca0dd341f4720fb930f97e46710b0
Author: kdevraje <Kiran.Devrajegowda@amd.com>
Date:   Mon May 27 16:24:43 2019 +0530

     This check in has changes w.r.t Copyright information, which is changed to (start year) - 2019
    
    Change-Id: Ide3c8f7172210b8d3538d3c36e88634ab1ba9041

commit ee123f535872510f77100d3d55a43d4ca56047d5
Author: Meghana <meghana.vankadari@amd.com>
Date:   Mon May 27 15:36:44 2019 +0530

    Defined small matrix thresholds for TRSM for various cases for NAPLES and ROME
    Updated copyright information for kernels/zen/bli_trsm_small.c file
    Removed separate kernels for zen2 architecture
    Instead added threshold conditions in zen kernels both for ROME and NAPLES
    
    Change-Id: Ifd715731741d649b6ad16b123a86dbd6665d97e5

commit 9d93a4caa21402d3a90aac45d7a1603736c9fd63
Author: prangana <pradeep.rao@amd.com>
Date:   Fri May 24 17:59:13 2019 +0530

    update version 2.0

commit 755730608d923538273a90c48bfdf77571f86519
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu May 23 17:34:36 2019 -0500

    Minor rewording of language around mt env. vars.

commit ba31abe73c97c16c78fffc59a215761b8d9fd1f6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu May 23 14:59:53 2019 -0500

    Added BLIS theading info to Performance.md.
    
    Details:
    - Documented the BLIS environment variables that were set
      (e.g. BLIS_JC_NT, BLIS_IC_NT, BLIS_JR_NT) for each machine and
      threading configuration in order to achieve the parallelism reported
      on in docs/Performance.md.

commit cb788ffc89cac03b44803620412a5e83450ca949
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu May 23 13:00:53 2019 -0500

    Increased MT sup threshold for double to 180.
    
    Details:
    - Increased the double-precision real MT threshold (which controls
      whether the sup implementation kicks for smaller m dimension values)
      from 80 to 180, and this change was made for both haswell and zen
      subconfigurations. This is less about the m dimension in particular
      and more about facilitating a smoother performance transition when
      m = n = k.

commit 057f5f3d211e7513f457ee6ca6c9555d00ad1e57
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu May 23 12:51:17 2019 -0500

    Minor build system housekeeping.
    
    Details:
    - Commented out redundant setting of LIBBLIS_LINK within all driver-
      level Makefiles. This variable is already set within common.mk, and
      so the only time it should be overridden is if the user wants to link
      to a different copy of libblis.
    - Very minor changes to build/gen-make-frags/gen-make-frag.sh.
    - Whitespace and inconsequential quoting change to configure.
    - Moved top-level 'windows' directory into a new 'attic' directory.

commit e05171118c377f356f89c4daf8a0d5ddc5a4e4f7
Author: Meghana <meghana.vankadari@amd.com>
Date:   Thu May 23 16:15:27 2019 +0530

    Implemented TRSM for small matrices for cases where A is on the right
    
    Added separate kernels for zen and zen2
    
    Change-Id: I6318ddc250cf82516c1aa4732718a35eae0c9134

commit 02920f5c480c42706b487e37b5ecc96c3555b851
Author: kdevraje <Kiran.Devrajegowda@amd.com>
Date:   Thu May 23 15:29:59 2019 +0530

    make checkblis fails for matrix dimension check at the begining hence reverting it
    
    Change-Id: Ibd2ee8c2d4914598b72003fbfc5845be9c9c1e87

commit 84215022f29fb3bfedd254d041635308d177e6c0
Author: kdevraje <Kiran.Devrajegowda@amd.com>
Date:   Thu May 23 11:08:41 2019 +0530

     Adding threshold condition to dgemm small matrix kernels, defining the constants in zen2 configuration
    
    Change-Id: I53a58b5d734925a6fcb8d8bea5a02ddb8971fcd5

commit a3554eb1dcc1b5b94d81c60761b2f01c3d827ffa
Merge: ea082f83 17b878b6
Author: kdevraje <Kiran.Devrajegowda@amd.com>
Date:   Thu May 23 11:51:07 2019 +0530

    Merge branch 'amd-staging-rome2.0' of ssh://git.amd.com:29418/cpulibraries/er/blis to configure zen2
    
    Change-Id: I97e17bca9716b80b862925f97bb513c07b4b0cae

commit ea082f839071dd9ec555062dc3851c31d12f00e4
Author: kdevraje <Kiran.Devrajegowda@amd.com>
Date:   Thu May 23 10:38:29 2019 +0530

    adding empty zen2 directory with .gitignore file
    
    Change-Id: Ifa37cf54b2578aa19ad335372b44bca17043fe4b

commit b80bd5bcb2be8551a9a21fafc8e6c8b6336c99b5
Author: Kiran Varaganti <Kiran.Varaganti@amd.com>
Date:   Tue May 21 15:11:47 2019 +0530

    config/zen/bli_cntx_init_zen.c: removed BLIS_ENBLE_ZEN_BLOCK_SIZES macro. We have different configurations for both zen and zen2
    config/zen/bli_family_zen.h: deleted macro BLIS_ENBLE_ZEN_BLOCK_SIZES
    config/zen/make_defs.mk: removed compiler flag -mno-avx256-split-unaligned-store
    frame/base/bli_cpuid.c: ROME family is 17H but model # is from 0x30H.
    test/test_gemm.c - commented out #define FILE_IN_OUT (some compilation error when BLIS is configured as amd64)
    Now we can use single configuration has ./configure amd64 - this will work both for ROME & Naples
    
    Change-Id: I91b4fc35380f8a35b4f4c345da040c6b5910b4a2

commit a042db011df9a1c3e7c7ac546541f4746b176ea5
Author: Kiran Varaganti <Kiran.Varaganti@amd.com>
Date:   Mon May 20 14:17:32 2019 +0530

    Modified make_defs.mk for zen2 to get compiled by gcc version less than gcc9.0
    
    Change-Id: I8fcac30538ee39534c296932639053b47b9a2d43

commit a23f92594cf3d530e5794307fe97afc877d853b7
Author: Kiran Varaganti <Kiran.Varaganti@amd.com>
Date:   Mon May 20 10:48:06 2019 +0530

    config_registry: New AMD zen2 architecture configuration added.
      frame/base/bli_arch.c: #ifdef BLIS_FAMILY_ZEN2 id = BLIS_ARCH_ZEN2; #endif added. zen2 is added in config_name[BLIS_NUM_ARCHS]
      frame/base/bli_cpuid.c : #ifdef BLIS_CONFIG_ZEN2 if ( bli_cpuid_is_zen2( family, model, features ) ) return BLIS_ARCH_ZEN2; #endif, defined new function bool bli_cpuid_is_zen2(...).
      frame/base/bli_cpuid.h : declared bli_cpuid_is_zen2(..).
      frame/base/bli_gks.c : #ifdef BLIS_CONFIG_ZEN2 bli_gks_register_cntx(BLIS_ARCH_ZEN2, bli_cntx_init_zen2, bli_cntx_init_zen2_ref, bli_cntx_init_zen2_ind); #endif
      frame/include/bli_arch_config.h : #ifdef BLIS_CONFIG_ZEN2 CNTX_INIT_PROTS(zen2) #endif #ifdef BLIS_FAMILY_ZEN2 #include "bli_family_zen2.h" #endif
      frame/include/bli_type_defs.h : added BLIS_ARCH_ZEN2 in arch_t enum. BLIS_NUM_ARCHS 20
    
    Change-Id: I2a2d9b7266673e78a4f8543b1bfb5425b0aa7866

commit 17b878b66d917d50b6fe23721d8579e826cb3e8c
Author: kdevraje <Kiran.Devrajegowda@amd.com>
Date:   Wed May 22 14:02:53 2019 +0530

    adding license same as in ut-austin-amd-branch
    
    Change-Id: I6790768d2bf5d42369d304ef93e34701f95fbaff

commit df755848b8a271323e007c7a628c64af63deab00
Merge: ca4b33c0 c72ae27a
Author: kdevraje <Kiran.Devrajegowda@amd.com>
Date:   Wed May 22 13:30:07 2019 +0530

    Merge branch 'amd-staging-rome2.0' of ssh://git.amd.com:29418/cpulibraries/er/blis into rome2.0
    
    Change-Id: Ie8aad1ab810f0f3c0b90ec67f9dd3dfb8dcc74cc

commit c72ae27adee4726679ee004d02c972582b5285b4
Author: Nisanth M P <nisanth.padinharepatt@amd.com>
Date:   Mon Mar 19 12:49:26 2018 +0530

    Re-enabling the small matrix gemm optimization for target zen
    
    Change-Id: I13872784586984634d728cd99a00f71c3f904395

commit ab0818af80f7f683080873f3fa24734b65267df2
Author: sraut <Biplab.Raut@amd.com>
Date:   Wed Oct 3 15:30:33 2018 +0530

    Review comments incorporated for small TRSM.
    
    Change-Id: Ia64b7b2c0375cc501c2cb0be8a1af93111808cd9

commit 32392cfc72af7f42da817a129748349fb1951346
Author: Jeff Hammond <jeff.r.hammond@intel.com>
Date:   Tue May 14 15:52:30 2019 -0400

    add info about CXX in configure (#311)

commit fa7e6b182b8365465ade178b0e4cd344ff6f6460
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed May 1 19:13:00 2019 -0500

    Define _POSIX_C_SOURCE in bli_system.h.
    
    Details:
    - Added
        #ifndef _POSIX_C_SOURCE
        #define _POSIX_C_SOURCE 200809L
        #endif
      to bli_system.h so that an application that uses BLIS (specifically,
      an application that #includes blis.h) does not need to remember to
      #define the macro itself (either on the command line or in the code
      that includes blis.h) in order to activate things like the pthreads.
      Thanks to Christos Psarras for reporting this issue and suggesting
      this fix.
    - Commented out #include <sys/time.h> in bli_system.h, since I don't
      think this header is used/needed anymore.
    - Comment update to function macro for bli_?normiv_unb_var1() in
      frame/util/bli_util_unb_var1.c.

commit 3df84f1b5d5e1146bb01bfc466ac20c60a9cc859
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Apr 27 21:27:32 2019 -0500

    Minor bugfixes in sup dgemm implementation.
    
    Details:
    - Fixed an obscure but in the bli_dgemmsup_rv_haswell_asm_5x8n() kernel
      that only affected the beta == 0, column-storage output case. Thanks
      to the BLAS test drivers for catching this bug.
    - Previously, bli_gemmsup_ref_var1n() and _var2m() were returning if
      k = 0, when the correct action would be to scale by beta (and then
      return). Thanks to the BLAS test drivers to catching this bug.
    - Changed the sup threshold behavior such that the sup implementation
      only kicks in if a matrix dimension is strictly less than (rather than
      less than or equal to) the threshold in question.
    - Initialize all thresholds to zero (instead of 10) by default in
      ref_kernels/bli_cntx_ref.c. This, combined with the above change to
      threshold testing means that calls to BLIS or BLAS with one or more
      matrix dimensions of zero will no longer trigger the sup
      implementation.
    - Added disabled debugging output to frame/3/bli_l3_sup.c (for future
      use, perhaps).

commit ecbdd1c42dcebfecd729fe351e6bb0076aba7d81
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Apr 27 19:38:11 2019 -0500

    Ceased use of BLIS_ENABLE_SUP_MR/NR_EXT macros.
    
    Details:
    - Removed already limited use of the BLIS_ENABLE_SUP_MR_EXT and
      BLIS_ENABLE_SUP_NR_EXT macros in bli_gemmsup_ref_var1n() and
      bli_gemmsup_ref_var2m(). Their purpose was merely to avoid a long
      conditional that would determine whether to allow the last iteration
      to be merged with the second-to-last iteration. Functionally, the
      macros were not needed, and they ended up causing problems when
      building configuration families such as intel64 and x86_64.

commit aa8a6bec3036a41e1bff2034f8ef6766a704ec49
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Apr 27 18:53:33 2019 -0500

    Fixed typo in --disable-sup-handling macro guard.
    
    Details:
    - Fixed an incorrectly-named macro guard that is intended to allow
      disabling of the sup framework via the configure option
      --disable-sup-handling. In this case, the preprocessor macro,
      BLIS_DISABLE_SUP_HANDLING, was still named by its name from an older
      uncommitted version of the code (BLIS_DISABLE_SM_HANDLING).

commit b9c9f03502c78a63cfcc21654b06e9089e2a3822
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Apr 27 18:44:50 2019 -0500

    Implemented gemm on skinny/unpacked matrices.
    
    Details:
    - Implemented a new sub-framework within BLIS to support the management
      of code and kernels that specifically target matrix problems for which
      at least one dimension is deemed to be small, which can result in long
      and skinny matrix operands that are ill-suited for the conventional
      level-3 implementations in BLIS. The new framework tackles the problem
      in two ways. First the stripped-down algorithmic loops forgo the
      packing that is famously performed in the classic code path. That is,
      the computation is performed by a new family of kernels tailored
      specifically for operating on the source matrices as-is (unpacked).
      Second, these new kernels will typically (and in the case of haswell
      and zen, do in fact) include separate assembly sub-kernels for
      handling of edge cases, which helps smooth performance when performing
      problems whose m and n dimension are not naturally multiples of the
      register blocksizes. In a reference to the sub-framework's purpose of
      supporting skinny/unpacked level-3 operations, the "sup" operation
      suffix (e.g. gemmsup) is typically used to denote a separate namespace
      for related code and kernels. NOTE: Since the sup framework does not
      perform any packing, it targets row- and column-stored matrices A, B,
      and C. For now, if any matrix has non-unit strides in both dimensions,
      the problem is computed by the conventional implementation.
    - Implemented the default sup handler as a front-end to two variants.
      bli_gemmsup_ref_var2() provides a block-panel variant (in which the
      2nd loop around the microkernel iterates over n and the 1st loop
      iterates over m), while bli_gemmsup_ref_var1() provides a panel-block
      variant (2nd loop over m and 1st loop over n). However, these variants
      are not used by default and provided for reference only. Instead, the
      default sup handler calls _var2m() and _var1n(), which are similar
      to _var2() and _var1(), respectively, except that they defer to the
      sup kernel itself to iterate over the m and n dimension, respectively.
      In other words, these variants rely not on microkernels, but on
      so-called "millikernels" that iterate along m and k, or n and k.
      The benefit of using millikernels is a reduction of function call
      and related (local integer typecast) overhead as well as the ability
      for the kernel to know which micropanel (A or B) will change during
      the next iteration of the 1st loop, which allows it to focus its
      prefetching on that micropanel. (In _var2m()'s millikernel, the upanel
      of A changes while the same upanel of B is reused. In _var1n()'s, the
      upanel of B changes while the upanel of A is reused.)
    - Added a new configure option, --[en|dis]able-sup-handling, which is
      enabled by default. However, the default thresholds at which the
      default sup handler is activated are set to zero for each of the m, n,
      and k dimensions, which effectively disables the implementation. (The
      default sup handler only accepts the problem if at least one dimension
      is smaller than or equal to its corresponding threshold. If all
      dimensions are larger than their thresholds, the problem is rejected
      by the sup front-end and control is passed back to the conventional
      implementation, which proceeds normally.)
    - Added support to the cntx_t structure to track new fields related to
      the sup framework, most notably:
      - sup thresholds: the thresholds at which the sup handler is called.
      - sup handlers: the address of the function to call to implement
        the level-3 skinny/unpacked matrix implementation.
      - sup blocksizes: the register and cache blocksizes used by the sup
        implementation (which may be the same or different from those used
        by the conventional packm-based approach).
      - sup kernels: the kernels that the handler will use in implementing
        the sup functionality.
      - sup kernel prefs: the IO preference of the sup kernels, which may
        differ from the preferences of the conventional gemm microkernels'
        IO preferences.
    - Added a bool_t to the rntm_t structure that indicates whether sup
      handling should be enabled/disabled. This allows per-call control
      of whether the sup implementation is used, which is useful for test
      drivers that wish to switch between the conventional and sup codes
      without having to link to different copies of BLIS. The corresponding
      accessor functions for this new bool_t are defined in bli_rntm.h.
    - Implemented several row-preferential gemmsup kernels in a new
      directory, kernels/haswell/3/sup. These kernels include two general
      implementation types--'rd' and 'rv'--for the 6x8 base shape, with
      two specialized millikernels that embed the 1st loop within the kernel
      itself.
    - Added ref_kernels/3/bli_gemmsup_ref.c, which provides reference
      gemmsup microkernels. NOTE: These microkernels, unlike the current
      crop of conventional (pack-based) microkernels, do not use constant
      loop bounds. Additionally, their inner loop iterates over the k
      dimension.
    - Defined new typedef enums:
      - stor3_t: captures the effective storage combination of the level-3
        problem. Valid values are BLIS_RRR, BLIS_RRC, BLIS_RCR, etc. A
        special value of BLIS_XXX is used to denote an arbitrary combination
        which, in practice, means that at least one of the operands is
        stored according to general stride.
      - threshid_t: captures each of the three dimension thresholds.
    - Changed bli_adjust_strides() in bli_obj.c so that bli_obj_create()
      can be passed "-1, -1" as a lazy request for row storage. (Note that
      "0, 0" is still accepted as a lazy request for column storage.)
    - Added support for various instructions to bli_x86_asm_macros.h,
      including imul, vhaddps/pd, and other instructions related to integer
      vectors.
    - Disabled the older small matrix handling code inserted by AMD in
      bli_gemm_front.c, since the sup framework introduced in this commit
      is intended to provide a more generalized solution.
    - Added test/sup directory, which contains standalone performance test
      drivers, a Makefile, a runme.sh script, and an 'octave' directory
      containing scripts compatible with GNU Octave. (They also may work
      with matlab, but if not, they are probably close to working.)
    - Reinterpret the storage combination string (sc_str) in the various
      level-3 testsuite modules (e.g. src/test_gemm.c) so that the order
      of each matrix storage char is "cab" rather than "abc".
    - Comment updates in level-3 BLAS API wrappers in frame/compat.

commit 0d549ceda822833bec192bbf80633599620c15d9
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Sat Apr 27 22:56:02 2019 +0000

    make unix friendly archives on appveyor (#310)

commit ca4b33c001f9e959c43b95a9a23f9df5adec7adf
Author: Kiran Varaganti <Kiran.Varaganti@amd.com>
Date:   Wed Apr 24 15:02:39 2019 +0530

     Added compiler option (-mno-avx256-split-unaligned-store) in the file config/zen/make_defs.mk to improve performance of intrinsic codes, this flag ensures compiler generates 256-bit stores for the equivalent intrinsics code.
    
    Change-Id: I8f8cd81a3604869df18d38bc42097a04f178d324

commit 945928c650051c04d6900c7f4e9e29cd0e5b299f
Merge: 663f6629 74e513eb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Apr 17 15:58:56 2019 -0500

    Merge branch 'amd' of github.com:flame/blis into amd

commit 74e513eb6a6787a925d43cd1500277d54d86ab8f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Apr 17 13:34:44 2019 -0500

    Support row storage in Eigen gemm test/3 driver.
    
    Details:
    - Added preprocessor branches to test/3/test_gemm.c to explicitly
      support row-stored matrices. Column-stored matrices are also still
      supported (and is the default for now). (This is mainly residual work
      leftover from initial integration of Eigen into the test drivers, so
      if we ever want to test Eigen with row-stored matrices, the code will
      be ready to use, even if it is not yet integrated into the Makefile
      in test/3.)

commit b5d457fae9bd75c4ca67f7bc7214e527aa248127
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 16 12:50:01 2019 -0500

    Applied forgotten variable rename from 89a70cc.
    
    Details:
    - Somehow the variable name change (root_file_name -> root_inputname)
      in flatten-headers.py mentioned in the commit log entry for 89a70cc
      didn't make it into the actual commit. This commit applies that
      change.

commit 89a70cccf869333147eb2559cdfa5a23dc915824
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 11 18:33:08 2019 -0500

    GNU-like handling of installation prefix et al.
    
    Details:
    - Changed the default installation prefix from $HOME/lib to /usr/local.
    - Modified the way configure internally handles the prefix, libdir,
      includedir, and sharedir (and also added an --exec-prefix option).
      The defaults to these variables are set as follows:
        prefix:      /usr/local
        exec_prefix: ${prefix}
        libdir:      ${exec_prefix}/lib
        includedir:  ${prefix}/include
        sharedir:    ${prefix}/share
      The key change, aside from the addition of exec_prefix and its use to
      define the default to libdir, is that the variables are substituted
      into config.mk with quoting that delays evaluation, meaning the
      substituted values may contain unevaluated references to other
      variables (namely, ${prefix} and ${exec_prefix}). This more closely
      follows GNU conventions, including those used by GNU autoconf, and
      also allows make to override any one of the variables *after*
      configure has already been run (e.g. during 'make install').
    - Updates to build/config.mk.in pursuant to above changes.
    - Updates to output of 'configure --help' pursuant to above changes.
    - Updated docs/BuildSystem.md to reflect the new default installation
      prefix, as well as mention EXECPREFIX and SHAREDIR.
    - Changed the definitions of the UNINSTALL_OLD_* variables in the
      top-level Makefile to use $(wildcard ...) instead of 'find'. This
      was motivated by the new way of handling prefix and friends, which
      leads to the 'find' command being run on /usr/local (by default),
      which can take a while almost never yielding any benefit (since the
      user will very rarely use the uninstall-old targets).
    - Removed periods from the end of descriptive output statements (i.e.,
      non-verbose output) since those statements often end with file or
      directory paths, which get confusing to read when puctuated by a
      period.
    - Trival change to 'make showconfig' output.
    - Removed my name from 'configure --help'. (Many have contributed to it
      over the years.)
    - In configure script, changed the default state of threading_model
      variable from 'no' to 'off' to match that of debug_type, where there
      are similarly more than two valid states. ('no' is still accepted
      if given via the --enable-debug= option, though it will be
      standardized to 'off' prior to config.mk being written out.)
    - Minor variable name change in flatten-headers.py that was intended for
      32812ff.
    - CREDITS file update.

commit 9d76688ad90014a11ddc0c2f27253d62806216b1
Author: kdevraje <Kiran.Devrajegowda@amd.com>
Date:   Thu Apr 11 10:22:48 2019 +0530

    Fix for single rank crash with HPL application. When computing offset of C buffer, as integer variables are used for a row and column index, the intermediate result value overflows and a negative value gets added to the buffer, when the negative value is too large it would index the buffer out of the range resulting in segmentation fault. Although the crash is a result of dgemm kernel, added similar code in sgemm kernel also.
    
    Change-Id: I171119b0ec0dfbd8e63f1fcd6609a94384aabd27

commit 32812ff5aba05d34c421fe1024a61f3e2d5e7052
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 9 12:20:19 2019 -0500

    Minor bugfix to flatten-headers.py.
    
    Details:
    - Fixed a minor bug in flatten-headers.py whereby the script, upon
      encountering a #include directive for the root header file, would
      erroneously recurse and inline the conents of that root header.
      The script has been modified to avoid recursion into any headers
      that share the same name as the root-level header that was passed
      into the script. (Note: this bug didn't actually manifest in BLIS,
      so it's merely a precaution for usage of flatten-headers.py in other
      contexts.)

commit bec90e0b6aeb3c9b19589c2b700fda2d66f6ccdf
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 2 17:45:13 2019 -0500

    Minor update to docs/HardwareSupport.md document.
    
    Details:
    - Added more details and clarifying language to implications of 1m and
      the recycling of microkernels between microarchitectures.

commit 89cd650e7be01b59aefaa85885a3ea78970351e4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 2 17:23:55 2019 -0500

    Use void_fp for function pointers instead of void*.
    
    Change void*-typed function pointers to void_fp.
    - Updated all instances of void* variables that store function pointers
      to variables of a new type, void_fp. Originally, I wanted to define
      the type of void_fp as "void (*void_fp)( void )"--that is, a pointer
      to a function with no return value and no arguments. However, once
      I did this, I realized that gcc complains with incompatible pointer
      type (-Wincompatible-pointer-types) warnings every time any such a
      pointer is being assigned to its final, type-accurate function
      pointer type. That is, gcc will silently typecast a void* to
      another defined function pointer type (e.g. dscalv_ker_ft) during
      an assignment from the former to the latter, but the same statement
      will trigger a warning when typecasting from a void_fp type. I suspect
      an explicit typecast is needed in order to avoid the warning, which
      I'm not willing to insert at this time.
    - Added a typedef to bli_type_defs.h defining void_fp as void*, along
      with a commented-out version of the aborted definition described
      above. (Note that POSIX requires that void* and function pointers
      be interchangeable; it is the C standard that does not provide this
      guarantee.)
    - Comment updates to various _oapi.c files.

commit ffce3d632b284eb52474036096815ec38ca8dd5f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 2 14:40:50 2019 -0500

    Renamed armv8a gemm kernel filename.
    
    Details:
    - Renamed
        kernels/armv8a/3/bli_gemm_armv8a_opt_4x4.c
      to
        kernels/armv8a/3/bli_gemm_armv8a_asm_d6x8.c.
      This follows the naming convention used by other kernel sets, most
      notably haswell.

commit 77867478af02144544b4e7b6df5d54d874f3f93b
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Tue Apr 2 13:33:11 2019 -0500

    Use pthreads on MinGW and Cygwin (#307)

commit 7bc75882f02ce3470a357950878492e87e688cec
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Mar 28 17:40:50 2019 -0500

    Updated Eigen results in docs/graphs with 3.3.90.
    
    Details:
    - Updated the level-3 performance graphs in docs/graphs with new Eigen
      results, this time using a development version cloned from their git
      mirror on March 27, 2019 (version 3.3.90). Performance is improved
      over 3.3.7, though still noticeably short of BLIS/MKL in most cases.
    - Very minor updates to docs/Performance.md and matlab scripts in
      test/3/matlab.

commit 20ea7a1217d3833db89a96158c42da2d6e968ed8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Mar 27 18:09:17 2019 -0500

    Minor text updates (Eigen) to docs/Performance.md.
    
    Details:
    - Added/updated a few more details, mostly regarding Eigen.

commit bfb7e1bc6af468e4ff22f7e27151ea400dcd318a
Merge: 044df950 2c85e1dd
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Mar 27 17:58:19 2019 -0500

    Merge branch 'dev'

commit 2c85e1dd9d5d84da7228ea4ae6deec56a89b3a8f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Mar 27 16:29:51 2019 -0500

    Added Eigen results to performance graphs.
    
    Details:
    - Updated the Haswell, SkylakeX, and Epyc performance graphs in
      docs/graphs to report on Eigen implementations, where applicable.
      Specifically, Eigen implements all level-3 operations sequentially,
      however, of those operations it only provides multithreaded gemm.
      Thus, mt results for symm/hemm, syrk/herk, trmm, and trsm are
      omitted. Thanks to Sameer Agarwal for his help configuring and
      using Eigen.
    - Updated docs/Performance.md to note the new implementation tested.
    - CREDITS file update.

commit bfac7e385f8061f2e6591de208b0acf852f04580
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Mar 27 16:04:48 2019 -0500

    Added ability to plot with Eigen in test/3/matlab.
    
    Details:
    - Updated matlab scripts in test/3/matlab to optionally plot/display
      Eigen performance curves. Whether Eigen is plotted is determined by
      a new boolean function parameter, with_eigen.
    - Updated runme.m scratchpad to reflect the latest invocations of the
      plot_panel_4x5() function (with Eigen plotting enabled).

commit 67535317b9411c90de7fa4cb5b0fdb8f61fdcd79
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Mar 27 13:32:18 2019 -0500

    Fixed mislabeled eigen output from test/3 drivers.
    
    Details:
    - Fixed the Makefile in test/3 so that it no longer incorrectly labels
      the matlab output variables from Eigen-linked hemm, herk, trmm, and
      trsm driver output as "vendor". (The gemm drivers were already
      correctly outputing matlab variables containing the "eigen" label.)

commit 044df9506f823643c0cdd53e81ad3c27a9f9d4ff
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Wed Mar 27 12:39:31 2019 -0500

    Test with shared on windows (#306)
    
    Export macros can't support both shared and static at the same time.
    When blis is built with both shared and static, headers assume that
    shared is used at link time and dllimports the symbols with __imp_
    prefix.
    
    To use the headers with static libraries a user can give
    -DBLIS_EXPORT= to import the symbol without the __imp_ prefix

commit 5e6b160c8a85e5e23bab0f64958a8acf4918a4ed
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 26 19:10:59 2019 -0500

    Link to Eigen BLAS for non-gemm drivers in test/3.
    
    Details:
    - Adjusted test/3/Makefile so that the test drivers are linked against
      Eigen's BLAS library for hemm, herk, trmm, and trsm. We have to do
      this since Eigen's headers don't define implementations to the
      standard BLAS APIs.
    - Simplified #included headers in hemm, herk, trmm, and trsm source
      driver files, since nothing specific to Eigen is needed at
      compile-time for those operations.

commit e593221383aae19dfdc3f30539de80ed05cfec7f
Merge: 92fb9c87 c208b9dc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 26 15:51:45 2019 -0500

    Merge branch 'master' into dev

commit 92fb9c87bf88b9f9c401eeecd9aa9c3521bc2adb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 26 15:43:23 2019 -0500

    Add more support for Eigen to drivers in test/3.
    
    Details:
    - Use compile-time implementations of Eigen in test_gemm.c via new
      EIGEN cpp macro, defined on command line. (Linking to Eigen's BLAS
      library is not necessary.) However, as of Eigen 3.3.7, Eigen only
      parallelizes the gemm operation and not hemm, herk, trmm, trsm, or
      any other level-3 operation.
    - Fixed a bug in trmm and trsm drivers whereby the wrong function
      (bli_does_trans()) was being called to determine whether the object
      for matrix A should be created for a left- or right-side case. This
      was corrected by changing the function to bli_is_left(), as is done
      in the hemm driver.
    - Added support for running Eigen test drivers from runme.sh.

commit c208b9dc46852c877197d53b6dd913a046b6ebb6
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Mon Mar 25 13:03:44 2019 -0500

    Fix clang version detection (#305)
    
    clang -dumpversion gives 4.2.1 for all clang versions as clang was
    originally compatible with gcc 4.2.1
    
    Apple clang version and clang version are two different things
    and the real clang version cannot be deduced from apple clang version
    programatically. Rely on wikipedia to map apple clang to clang version
    
    Also fixes assembly detection with clang
    
    clang 3.8 can't build knl as it doesn't recognize zmm0

commit 53842c7e7d530cb2d5609d6d124ae350fc345c32
Author: Kiran Varaganti <Kiran.Varaganti@amd.com>
Date:   Fri Mar 22 13:57:14 2019 +0530

    Removed printing alpha and beta values
    
    Change-Id: I49102db510311a30f6a936f9d843f35838f50d23

commit 6805db45e343d83d1adaf9157cf0b841653e9ede
Author: Kiran Varaganti <Kiran.Varaganti@amd.com>
Date:   Fri Mar 22 12:55:35 2019 +0530

    Corrected setting alpha & beta values- alpha = -1 and beta = 1 - bli_setc(-1.0, 0, &alpha) should be used rather than bli_setc(0.0, -1.0, &alpha). This corrected now
    
    Change-Id: Ic1102dfd6b50ccf212386a1211c6f31e8d987ef9

commit feefcab4427a75b0b55af215486b85abcda314f7
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Mar 21 18:11:20 2019 -0500

    Allow disabling of BLAS prototypes at compile-time.
    
    Details:
    - Modified bli_blas.h so that:
      - By default, if the BLAS layer is enabled at configure-time, BLAS
        prototypes are also enabled within blis.h;
      - But if the user #defines BLIS_DISABLE_BLAS_DEFS prior to including
        blis.h, BLAS prototypes are skipped over entirely so that, for
        example, the application or some other header pulled in by the
        application may prototype the BLAS functions without causing any
        duplication.
    - Updated docs/BuildSystem.md to document the feature above, and
      related text.

commit 20153cd4b594bc34f860c381ec18de3a6cc743c7
Author: Kiran Varaganti <Kiran.Varaganti@amd.com>
Date:   Thu Mar 21 16:23:53 2019 +0530

    Modified test_gemm.c file in test folder
    A Macro 'FILE_IN_OUT" is defined to read input parameters from a csv file.
    Format for input file:
    Each line defines a gemm problem with following parameters: m k n cs_a cs_b cs_c
    The operation always implemented is C = C - A*B and column-major format.
    When macro is disabled - it reverts back to original implementation.
    Usage: ./test_gemm_<mkl/blis/openblas>.x input.csv output.csv
    GEMM is called through BLAS interface
    For BLIS - the test application also prints either 'S' indicating small gemm routine or 'N' - conventional BLIS gemm
    for MKL/OpenBLAS - ignore this character
    
    Change-Id: I0924ef2c1f7bdea48d4cdb230b888e2af2c86a36

commit 288843b06d91e1b4fade337959aef773090bd1c9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Mar 20 17:52:23 2019 -0500

    Added Eigen support to test/3 Makefile, runme.sh.
    
    Details:
    - Added targets to test/3/Makefile that link against a BLAS library
      build by Eigen. It appears, however, that Eigen's BLAS library does
      not support multithreading. (It may be that multithreading is only
      available when using the native C++ APIs.)
    - Updated runme.sh with a few Eigen-related tweaks.
    - Minor tweaks to docs/Performance.md.

commit 153e0be21d9ff413e370511b68d553dd02abada9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 19 17:53:18 2019 -0500

    More minor tweaks to docs/Performance.md.
    
    Details:
    - Defined GFLOPS as billions of floating-point operations per second,
      and reworded the sentence after about normalization.

commit 05c4e42642cc0c8dbfa94a6c21e975ac30c0517a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 19 17:07:20 2019 -0500

    CHANGELOG update (0.5.2)

commit 9204cd0cb0cc27790b8b5a2deb0233acd9edeb9b (tag: 0.5.2)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 19 17:07:18 2019 -0500

    Version file update (0.5.2)

commit 64560cd9248ebf4c02c4a1eeef958e1ca434e510
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 19 17:04:20 2019 -0500

    ReleaseNotes.md update in advance of next version.
    
    Details:
    - Updated ReleaseNotes.md in preparation for next version.

commit ab5ad557ea69479d487c9a3cb516f43fa1089863
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 19 16:50:41 2019 -0500

    Very minor tweaks to Performance.md.

commit 03c4a25e1aa8a6c21abbb789baa599ac419c3641
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 19 16:47:15 2019 -0500

    Minor fixes to docs/Performance.md.
    
    Details:
    - Fixed some incorrect labels associated with the pdf/png graphs,
      apparently the result of copy-pasting.

commit fe6dd8b132f39ecb8893d54cd8e75d4bbf6dab83
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 19 16:30:23 2019 -0500

    Fixed broken section links in docs/Performance.md.
    
    Details:
    - Fixed a few broken section links in the Contents section.

commit 913cf97653f5f9a40aa89a5b79e2b0a8882dd509
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 19 16:15:24 2019 -0500

    Added docs/Performance.md and docs/graphs subdir.
    
    Details:
    - Added a new markdown document, docs/Performance.md, which reports
      performance of a representative set of level-3 operations across a
      variety of hardware architectures, comparing BLIS to OpenBLAS and a
      vendor library (MKL on Intel/AMD, ARMPL on ARM). Performance graphs,
      in pdf and png formats, reside in docs/graphs.
    - Updated README.md to link to new Performance.md document.
    - Minor updates to CREDITS, docs/Multithreading.md.
    - Minor updates to matlab scripts in test/3/matlab.

commit 9945ef24fd758396b698b19bb4e23e53b9d95725
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 19 15:28:44 2019 -0500

    Adjusted cache blocksizes for zen subconfig.
    
    Details:
    - Adjusted the zen sub-configuration's cache blocksizes for float,
      scomplex, and dcomplex based on the existing values for double.
      (The previous values were taken directly from the haswell subconfig,
      which targets Intel Haswell/Broadwell/Skylake systems.)

commit d202d008d51251609d08d3c278bb6f4ca9caf8e4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Mar 18 18:18:25 2019 -0500

    Renamed --enable-export-all to --export-shared=[].
    
    Details:
    - Replaced the existing --enable-export-all / --disable-export-all
      configure option with --export-shared=[public|all], with the 'public'
      instance of the latter corresponding to --disable-export-all and the
      'all' instance corresponding to --enable-export-all. Nothing else
      semantically about the option, or its default, has changed.

commit ff78089870f714663026a7136e696603b5259560
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Mar 18 13:22:55 2019 -0500

    Updates to docs/Multithreading.md.
    
    Details:
    - Made extra explicit the fact that: (a) multithreading in BLIS is
      disabled by default; and (b) even with multithreading enabled, the
      user must specify multithreading at runtime in order to observe
      parallelism. Thanks to M. Zhou for suggesting these clarifications
      in #292.
    - Also made explicit that only the environment variable and global
      runtime API methods are available when using the BLAS API. If the
      user wishes to use the local runtime API (specify multithreading on
      a per-call basis), one of the native BLIS APIs must be used.

commit 3a929a3d0ba0353159a6d4cd188f01b7a390ccfc
Author: Kiran Varaganti <Kiran.Varaganti@amd.com>
Date:   Mon Mar 18 10:51:41 2019 +0530

    Fixed code merging: bli_gemm_small.c - missed conditional checks for L!=0 && K!=0. Now they are added. This fix is done to pass blastest
    
    Change-Id: Idc9c9a04d2015a68a19553c437ecaf8f1584026c

commit 663f662932c3f182fefc3c77daa1bf8c3394bb8b
Merge: 938c05ef 6bfe3812
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Mar 16 16:17:12 2019 -0500

    Merge branch 'amd' of github.com:flame/blis into amd

commit 938c05ef8654e2fc013d39a57f51d91d40cc40fb
Merge: 4ed39c09 5a5f494e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Mar 16 16:01:43 2019 -0500

    Merge branch 'amd' of github.com:flame/blis into amd

commit 6bfe3812e29b86c95b828822e4e5473b48891167
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Mar 15 13:57:49 2019 -0500

    Use -fvisibility=[...] with clang on Linux/BSD/OSX.
    
    Details:
    - Modified common.mk to use the -fvisibility=[hidden|default] option
      when compiling with clang on non-Windows platforms (Linux, BSD, OS X,
      etc.). Thanks to Isuru Fernando for pointing out this option works
      with clang on these OSes.

commit 809395649c5bbf48778ede4c03c1df705dd49566
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Mar 13 18:21:35 2019 -0500

    Annotated additional symbols for export.
    
    Details:
    - Added export annotations to additional function prototypes in order to
      accommodate the testsuite.
    - Disabled calling bli_amaxv_check() from within the testsuite's
      test_amaxv.c.

commit e095926c643fd9c9c2220ebecd749caae0f71d42
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Mar 13 17:35:18 2019 -0500

    Support shared lib export of only public symbols.
    
    Details:
    - Introduced a new configure option, --enable-export-all, which will
      cause all shared library symbols to be exported by default, or,
      alternatively, --disable-export-all, which will cause all symbols to
      be hidden by default, with only those symbols that are annotated for
      visibility, via BLIS_EXPORT_BLIS (and BLIS_EXPORT_BLAS for BLAS
      symbols), to be exported. The default for this configure option is
      --disable-export-all. Thanks to Isuru Fernando for consulting on
      this commit.
    - Removed BLIS_EXPORT_BLIS annotations from frame/1m/bli_l1m_unb_var1.h,
      which was intended for 5a5f494.
    - Relocated BLIS_EXPORT-related cpp logic from bli_config.h.in to
      frame/include/bli_config_macro_defs.h.
    - Provided appropriate logic within common.mk to implement variable
      symbol visibility for gcc, clang, and icc (to the extend that each of
      these compilers allow).
    - Relocated --help text associated with debug option (-d) to configure
      slightly further down in the list.

commit 5a5f494e428372c7c27ed1f14802e15a83221e87
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 12 18:45:09 2019 -0500

    Removed export macros from all internal prototypes.
    
    Details:
    - After merging PR #303, at Isuru's request, I removed the use of
      BLIS_EXPORT_BLIS from all function prototypes *except* those that we
      potentially wish to be exported in shared/dynamic libraries. In other
      words, I removed the use of BLIS_EXPORT_BLIS from all prototypes of
      functions that can be considered private or for internal use only.
      This is likely the last big modification along the path towards
      implementing the functionality spelled out in issue #248. Thanks
      again to Isuru Fernando for his initial efforts of sprinkling the
      export macros throughout BLIS, which made removing them where
      necessary relatively painless. Also, I'd like to thank Tony Kelman,
      Nathaniel Smith, Ian Henriksen, Marat Dukhan, and Matthew Brett for
      participating in the initial discussion in issue #37 that was later
      summarized and restated in issue #248.
    - CREDITS file update.

commit 3dc18920b6226026406f1d2a8b2c2b405a2649d5
Merge: b938c16b 766769ee
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 12 11:20:25 2019 -0500

    Merge branch 'master' into dev

commit 766769eeb944bd28641a6f72c49a734da20da755
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Mon Mar 11 19:05:32 2019 -0500

    Export functions without def file (#303)
    
    * Revert "restore bli_extern_defs exporting for now"
    
    This reverts commit 09fb07c350b2acee17645e8e9e1b8d829c73dca8.
    
    * Remove symbols not intended to be public
    
    * No need of def file anymore
    
    * Fix whitespace
    
    * No need of configure option
    
    * Remove export macro from definitions
    
    * Remove blas export macro from definitions

commit 4ed39c0971c7917e2675cf5449f563b1f4751ccc
Merge: 540ec1b4 b938c16b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Mar 8 11:56:58 2019 -0600

    Merge branch 'amd' of github.com:flame/blis into amd

commit b938c16b0c9e839335ac2c14944b82890143d02f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Mar 7 16:40:39 2019 -0600

    Renamed test/3m4m to test/3.
    
    Details:
    - Renamed '3m4m' directory to '3', which captures the directory nicely
      since it builds test drivers to test level-3 operations.
    - These test drivers ceased to be used to test the 3m and 4m (or even
      1m) induced methods long ago, hence the name change.

commit ab89a40582ec7acf802e59b0763bed099a02edd8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Mar 7 16:26:12 2019 -0600

    More minor updates and edits to test/3m4m.
    
    Details:
    - Further updates to matlab scripts, mostly for compatibility with
      GNU Octave.
    - More tweaks to runme.sh.
    - Updates to runme.m that allow copy-paste into matlab interactive
      session to generate graphs.

commit f0e70dfbf3fee4c4e382c2c4e87c25454cbc79a1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Mar 7 01:04:05 2019 +0000

    Very minor updates to test/3m4m for ul252.
    
    Details:
    - Very minor updates to the newly revamped test/3m4m drivers when used
      on a Xeon Platinum (SkylakeX).

commit 7fe44748383071f1cbbc77d904f4ae5538e13065
Author: Kiran Varaganti <Kiran.Varaganti@amd.com>
Date:   Wed Mar 6 16:23:31 2019 +0530

    Disabled BLIS_ENABLE_ZEN_BLOCK_SIZES in bli_family_zen.h for ROME tuning
    
    Change-Id: Iec47fcf51f4d4396afef1ce3958e58cf02c59a57

commit 9f1dbe572b1fd5e7dd30d5649bdf59259ad770d5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 5 17:47:55 2019 -0600

    Overhauled test/3m4m Makefile and scripts.
    
    Details:
    - Rewrote much of Makefile to generate executables for single- and dual-
      socket multithreading as well as single-threaded. Each of the three
      can also use a different problem size range/increment, as is often
      appropriate when doubling/halving the number of threads.
    - Rewrote runme.sh script to flexibly execute as many threading
      parameter scenarios as is given in the input parameter string
      (currently set within the script itself). The string also encodes
      the maximum problem size for each threading scenario, which is used
      to identify the executable to run. Also improved the "progress" output
      of the script to reduce redundant info and improve readability in
      terminals that are not especially wide.
    - Minor updates to test_*.c source files.
    - Updated matlab scripts according to changes made to the Makefile,
      test drivers, and runme.sh script, and renamed 'plot_all.m' to
      'runme.m'.

commit f5ed95ecd7d5eb4a63e1333ad5cc6765fc8df9fe
Author: Kiran Varaganti <Kiran.Varaganti@amd.com>
Date:   Tue Mar 5 15:01:57 2019 +0530

    Merged BLIS Release 1.3
    Modified config/zen/make_defs.mk, now CKVECFLAGS     := -mavx2 -mfpmath=sse -mfma -march=znver1
    
    Change-Id: Ia0942d285a21447cd0c470de1bc021fe63e80d81

commit 3bdab823fa93342895bf45d812439324a37db77c
Merge: 70f12f20 e2a02ebd
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 28 14:07:24 2019 -0600

    Merge branch 'master' into dev

commit e2a02ebd005503c63138d48a2b7d18978ee29205
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 28 13:58:59 2019 -0600

    Updates (from ls5) to test/3m4m/runme.sh.
    
    Details:
    - Lonestar5-specific updates to runme.sh.

commit f0dcc8944fa379d53770f5cae5d670140918f00c
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Wed Feb 27 17:27:23 2019 -0600

    Add symbol export macro for all functions (#302)
    
    * initial export of blis functions
    
    * Regenerate def file for master
    
    * restore bli_extern_defs exporting for now

commit 540ec1b479712d5e1da637a718927249c15d867f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Feb 24 19:09:10 2019 -0600

    Updated level-3 BLAS to call object API directly.
    
    Details:
    - Updated the BLAS compatibility layer for level-3 operations so that
      the corresponding BLIS object API is called directly rather than first
      calling the typed BLIS API. The previous code based on the typed BLIS
      API calls is still available in a deactivated cpp macro branch, which
      may be re-activated by #defining BLIS_BLAS3_CALLS_TAPI. (This does not
      yet correspond to a configure option. If it seems like people might
      want to toggle this behavior more regularly, a configure option can be
      added in the future.)
    - Updated the BLIS typed API to statically "pre-initialize" objects via
      new initializor macros. Initialization is then finished via calls to
      static functions bli_obj_init_finish_1x1() and bli_obj_init_finish(),
      which are similar to the previously-called functions,
      bli_obj_create_1x1_with_attached_buffer() and
      bli_obj_create_with_attached_buffer(), respectively. (The BLAS
      compatibility layer updates mentioned above employ this new technique
      as well.)
    - Transformed certain routines in bli_param_map.c--specifically, the
      ones that convert netlib-style parameters to BLIS equivalents--into
      static functions, now in bli_param_map.h. (The remaining three classes
      of conversation routines were left unchanged.)
    - Added the aforementioned pre-initializor macros to bli_type_defs.h.
    - Relocated bli_obj_init_const() and bli_obj_init_constdata() from
      bli_obj_macro_defs.h to bli_type_defs.h.
    - Added a few macros to bli_param_macro_defs.h for testing domains for
      real/complexness and precisions for single/double-ness.

commit 8e023bc914e9b4ac1f13614feb360b105fbe44d2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Feb 22 16:55:30 2019 -0600

    Updates to 3m4m/matlab scripts.
    
    Details:
    - Minor updates to matlab graph-generating scripts.
    - Added a plot_all.m script that is more of a scratchpad for copying and
      pasting function invocations into matlab to generate plots that are
      presently of interest to us.

commit b06244d98cc468346eb1a8eb931bc05f35ff280c
Merge: e938ff08 4c7e6680
Author: praveeng <praveen.g@amd.com>
Date:   Thu Feb 21 12:56:15 2019 +0530

    Merge branch 'ut-austin-amd' of ssh://git.amd.com:29418/cpulibraries/er/blis into ut-austin-amd

commit e938ff08cea3d108c84524eb129d9e89d701ea90
Author: praveeng <praveen.g@amd.com>
Date:   Thu Feb 21 12:44:38 2019 +0530

    deleted test.txt
    
    Change-Id: I3871f5fe76e548bc29ec2733745b29964e829dd3

commit ed13ad465dcba350ad3d5e16c9cc7542e33f3760
Author: mkv <Mallikarjuna-Reddy.K-V@amd.com>
Date:   Thu Feb 21 01:04:16 2019 -0500

    added test file for initial commit

commit 4c7e6680832b497468cf50c2399e3ac4de0e3450
Author: praveeng <praveen.g@amd.com>
Date:   Thu Feb 21 12:44:38 2019 +0530

    deleted test.txt
    
    Change-Id: I3871f5fe76e548bc29ec2733745b29964e829dd3

commit 95e070581c54ed2edc211874faec56055ea298c8
Author: mkv <Mallikarjuna-Reddy.K-V@amd.com>
Date:   Thu Feb 21 01:04:16 2019 -0500

    added test file for initial commit

commit 70f12f209bc1901b5205902503707134cf2991a0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Feb 20 16:10:10 2019 -0600

    Changed unsafe-loop to unsafe-math optimizations.
    
    Details:
    - Changed -funsafe-loop-optimizations (re-)introduced in 7690855 for
      make_defs.mk files' CRVECFLAGS to -funsafe-math-optimizations (to
      account for a miscommunication in issue #300). Thanks to Dave Love
      for this suggestion and Jeff Hammond for his feedback on the topic.

commit 7690855c5106a56e5b341a350f8db1c78caacd89
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Feb 18 19:16:01 2019 -0600

    Restored -funsafe-loop-optimizations to subconfigs.
    
    Details:
    - Restored use of -funsafe-loop-optimizations in the definitions of
      CRVECFLAGS (when using gcc), but only for sub-configurations (and
      not configuration families such as amd64, intel64, and x86_64).
      This more or less reverts 5190d05 and 6cf1550.

commit 44994d1490897b08cde52a615a2e37ddae8b2061
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Feb 18 18:35:30 2019 -0600

    Disable TBM, XOP, LWP instructions in AMD configs.
    
    Details:
    - Added -mno-tbm -mno-xop -mno-lwp to CKVECFLAGS in bulldozer,
      piledriver, steamroller, and excavator configurations to explicitly
      disable AMD's bulldozer-era TBM, XOP, and LWP instruction sets in an
      attempt to fix the invalid instruction error that has plagued Travis
      CI builds since 6a014a3. Thanks to Devin Matthews for pointing out
      that the offending instruction was part of TBM (issue #300).
    - Restored -O3 to piledriver configuration's COPTFLAGS.

commit 1e5b530744c1906140d47f43c5cad235eaa619cf
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Feb 18 18:04:38 2019 -0600

    Reverted piledriver COPTFLAGS from -O3 to -O2.
    
    Details:
    - Debugging continues; changing COPTFLAGS for piledriver subconfig from
      -O3 to -O2, its original value prior to 6a014a3.

commit 6cf155049168652c512aefdd16d74e7ff39b98df
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Feb 18 17:29:51 2019 -0600

    Removed -funsafe-loop-optimizations from all configs.
    
    Details:
    - Error persists. Removed -funsafe-loop-optimizations from all remaining
      sub-configurations.

commit 5190d05a27c5fa4c7942e20094f76eb9a9785c3e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Feb 18 17:07:35 2019 -0600

    Removed -funsafe-loop-optimizations from piledriver.
    
    Details:
    - Error persists; continuing debugging from bf0fb78c by removing
      -funsafe-loop-optimizations from piledriver configuration.

commit bf0fb78c5e575372060d22f5ceeb5b332e8978ec
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Feb 18 16:51:38 2019 -0600

    Removed -funsafe-loop-optimizations from families.
    
    Details:
    - Removed -funsafe-loop-optimizations from the configuration families
      affected by 6a014a3, specifically: intel64, amd64, and x86_64.
      This is part of an attempt to debug why the sde, as executed by
      Travis CI, is crashing via the following error:
    
        TID 0 SDE-ERROR: Executed instruction not valid for specified chip
        (ICELAKE): 0x9172a5: bextr_xop rax, rcx, 0x103

commit 6a014a3377a2e829dbc294b814ca257a2bfcb763
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Feb 18 14:52:29 2019 -0600

    Standardized optimization flags in make_defs.mk.
    
    Details:
    - Per Dave Love's recommendation in issue #300, this commit defines
        COPTFLAGS := -03
      and
        CRVECFLAGS := $(CKVECFLAGS) -funsafe-loop-optimizations
      in the make_defs.mk for all Intel- and AMD-based configurations.

commit 565fa3853b381051ac92cff764625909d105644d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Feb 18 11:43:58 2019 -0600

    Redirect trsm pc, ir parallelism to ic, jr loops.
    
    Details:
    - trsm parallelization was temporarily simplifed in 075143d to entirely
      ignore any parallelism specified via the pc or ir loops. Now, any
      parallelism specified to the pc loop will be redirected to the ic
      loop, and any parallelism specified to the ir loop will be redirected
      to the jr loop. (Note that because of inter-iteration dependencies,
      trsm cannot parallelize the ir loop. Parallelism via the pc loop is
      at least somewhat feasible in theory, but it would require tracking
      dependencies between blocks--something for which BLIS currently lacks
      the necessary supporting infrastructure.)

commit a023c643f25222593f4c98c2166212561d030621
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 14 20:18:55 2019 -0600

    Regenerated symbols in build/libblis-symbols.def.
    
    Details:
    - Reran ./build/regen-symbols.sh after running
      'configure --enable-cblas auto'

commit 075143dfd92194647da9022c1a58511b20fc11f3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 14 18:52:45 2019 -0600

    Added support for IC loop parallelism to trsm.
    
    Details:
    - Parallelism within the IC loop (3rd loop around the microkernel) is
      now supported within the trsm operation. This is done via a new branch
      on each of the control and thread trees, which guide execution of a
      new trsm-only subproblem from within bli_trsm_blk_var1(). This trsm
      subproblem corresponds to the macrokernel computation on only the
      block of A that contains the diagonal (labeled as A11 in algorithms
      with FLAME-like partitioning), and the corresponding row panel of C.
      During the trsm subproblem, all threads within the JC communicator
      participate and parallelize along the JR loop, including any
      parallelism that was specified for the IC loop. (IR loop parallelism
      is not supported for trsm due to inter-iteration dependencies.) After
      this trsm subproblem is complete, a barrier synchronizes all
      participating threads and then they proceed to apply the prescribed
      BLIS_IC_NT (or equivalent) ways of parallelism (and any BLIS_JR_NT
      parallelism specified within) to the remaining gemm subproblem (the
      rank-k update that is performed using the newly updated row-panel of
      B). Thus, trsm now supports JC, IC, and JR loop parallelism.
    - Modified bli_trsm_l_cntl_create() to create the new "prenode" branch
      of the trsm_l cntl_t tree. The trsm_r tree was left unchanged, for
      now, since it is not currently used. (All trsm problems are cast in
      terms of left-side trsm.)
    - Updated bli_cntl_free_w_thrinfo() to be able to free the newly shaped
      trsm cntl_t trees. Fixed a potentially latent bug whereby a cntl_t
      subnode is only recursed upon if there existed a corresponding
      thrinfo_t node, which may not always exist (for problems too small
      to employ full parallelization due to the minimum granularity imposed
      by micropanels).
    - Updated other functions in frame/base/bli_cntl.c, such as
      bli_cntl_copy() and bli_cntl_mark_family(), to recurse on sub-prenodes
      if they exist.
    - Updated bli_thrinfo_free() to recurse into sub-nodes and prenodes
      when they exist, and added support for growing a prenode branch to
      bli_thrinfo_grow() via a corresponding set of help functions named
      with the _prenode() suffix.
    - Added a bszid_t field thrinfo_t nodes. This field comes in handy when
      debugging the allocation/release of thrinfo_t nodes, as it helps trace
      the "identity" of each nodes as it is created/destroyed.
    - Renamed
        bli_l3_thrinfo_print_paths() -> bli_l3_thrinfo_print_gemm_paths()
      and created a separate bli_l3_thrinfo_print_trsm_paths() function to
      print out the newly reconfigured thrinfo_t trees for the trsm
      operation.
    - Trival changes to bli_gemm_blk_var?.c and bli_trsm_blk_var?.c
      regarding variable declarations.
    - Removed subpart_t enum values BLIS_SUBPART1T, BLIS_SUBPART1B,
      BLIS_SUBPART1L, BLIS_SUBPART1R. Then added support for two new labels
      (semantically speaking): BLIS_SUBPART1A and BLIS_SUBPART1B, which
      represent the subpartition ahead of and behind, respectively,
      BLIS_SUBPART1. Updated check functions in bli_check.c accordingly.
    - Shuffled layering/APIs for bli_acquire_mpart_[mn]dim() and
      bli_acquire_mpart_t2b/b2t(), _l2r/r2l().
    - Deprecated old functions in frame/3/bli_l3_thrinfo.c.

commit 78bc0bc8b6b528c79b11f81ea19250a1db7450ed
Author: Nicholai Tukanov <nicholai@utexas.edu>
Date:   Thu Feb 14 13:29:02 2019 -0600

    Power9 sub-configuration  (#298)
    
    Formally registered power9 sub-configuration.
    
    Details:
    - Added and registered power9 sub-configuration into the build system.
      Thanks to Nicholai Tukanov and Devangi Parikh for these contributions.
    - Note: The sub-configuration does not yet have a corresponding
      architecture-specific kernel set registered, and so for now the
      sub-config is using the generic kernel set.

commit 6b832731261f9e7ad003a9ea4682e9ca973ef844
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 12 16:01:28 2019 -0600

    Generalized ref kernels' pragma omp simd usage.
    
    Details:
    - Replaced direct usage of _Pragma( "omp simd" ) in reference kernels
      with PRAGMA_SIMD, which is defined as a function of the compiler being
      used in a new bli_pragma_macro_defs.h file. That definition is cleared
      when BLIS detects that the -fopenmp-simd command line option is
      unsupported. Thanks to Devin Matthews and Jeff Hammond for suggestions
      that guided this commit.
    - Updated configure and bli_config.h.in so that the appropriate anchor
      is substituted in (when the corresponding pragma omp simd support is
      present).

commit b1f5ce8622b682b79f956fed83f04a60daa8e0fc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 5 17:38:50 2019 -0600

    Minor updates to scripts in test/mixeddt/matlab.

commit 38203ecd15b1fa50897d733daeac6850d254e581
Author: Devangi N. Parikh <dnp@cs.utexas.edu>
Date:   Mon Feb 4 15:28:28 2019 -0500

    Added thunderx2 system in the mixeddt test scripts
    
    Details:
     - Added thunderx2 (tx2) as a system in the runme.sh in test/mixeddt

commit dfc91843ea52297bf636147793029a0c1345be04
Author: Devangi N. Parikh <dnp@cs.utexas.edu>
Date:   Mon Feb 4 15:23:40 2019 -0500

    Fixed gcc flags for thunderx2 subconfiguration
    
    Details:
    - Fixed -march flag. Thunderx2 is an armv8.1a architecture not armv8a.

commit c665eb9b888ec7e41bd0a28c4c8ac4094d0a01b5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jan 28 16:22:23 2019 -0600

    Minor updates to docs, Makefiles.
    
    Details:
    - Changed all occurrances of
        micro-kernel -> microkernel
        macro-kernel -> macrokernel
        micro-panel  -> micropanel
      in all markdown documents in 'docs' directory. This change is being
      made since we've reached the point in adoption and acceptance of
      BLIS's insights where words such as "microkernel" are no longer new,
      and therefore now merit being unhyphenated.
    - Updated "Implementation Notes" sections of KernelsHowTo.md, which
      still contained references to nonexistent cpp macros such as
      BLIS_DEFAULT_MR_? and BLIS_PACKDIM_MR_?.
    - Added 'run-fast' and 'check-fast' targets to testsuite/Makefile.
    - Minor updates to Testsuite.md, including suggesting use of
      'make check' and 'make check-fast' when running from the local
      testsuite directory.
    - Added a comment to top-level Makefile explaining the purpose behind
      the TESTSUITE_WRAPPER variable, which at first glance appears to serve
      no purpose.

commit 1aa280d0520ed5eaea3b119b4e92b789ecad78a4
Author: M. Zhou <5723047+cdluminate@users.noreply.github.com>
Date:   Sun Jan 27 21:40:48 2019 +0000

    Amend OS detection for kFreeBSD. (#295)

commit fffc23bb35d117a433886eb52ee684ff5cf6997f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jan 25 13:35:31 2019 -0600

    CREDITS file update.

commit 26c5cf495ce22521af5a36a1012491213d5a4551
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jan 24 18:49:31 2019 -0600

    Fixed bug in skx subconfig related to bdd46f9.
    
    Details:
    - Fixed code in the skx subconfiguration that became a bug after
      committing bdd46f9. Specifically, the bli_cntx_init_skx() function
      was overwriting default blocksizes for the scomplex and dcomplex
      microkernels despite the fact that only single and double real
      microkernels were being registered. This was not a problem prior to
      bdd46f9 since all microkernels used dynamically-queried (at runtime)
      register blocksizes for loop bounds. However, post-bdd46f9, this
      became a bug because the reference ukernels for scomplex and dcomplex
      were written with their register blocksizes hard-coded as constant
      loop bounds, which conflicted the the erroneous scomplex and dcomplex
      values that bli_cntx_init_skx() was setting in the context. The
      lesson here is that going forward, all subconfigurations must not set
      any blocksizes for datatypes corresponding to default/reference
      microkernels. (Note that a blocksize is left unchanged by the
      bli_cntx_set_blkszs() function if it was set to -1.)

commit 180f8e42e167b83a757340ad4bd4a5c7a1d6437b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jan 24 18:01:15 2019 -0600

    Fixed undefined behavior trsm ukr bug in bdd46f9.
    
    Details:
    - Fixed a bug that mainfested anytime a configuration was used in which
      optimized microkernels were registered and the trsm operation (or
      kernel) was invoked. The bug resulted from the optimized microkernels'
      register blocksizes conflicting with the hard-coded values--expressed
      in the form of constant loop bounds--used in the new reference trsm
      ukernels that were introduced in bdd46f9. The fix was easy: reverting
      back to the implementation that uses variable-bound loops, which
      amounted to changing an #if 0 to #if 1 (since I preserved the older
      implementation in the file alongside the new code based on constant-
      bound loops). It should be noted that this fix must be permanent,
      since the trsm kernel code with constant-bound loops can never work
      with gemm ukernels that use different register blocksizes.

commit bdd46f9ee88057d52610161966a11c224e5a026c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jan 24 17:23:18 2019 -0600

    Rewrote reference kernels to use #pragma omp simd.
    
    Details:
    - Rewrote level-1v, -1f, and -3 reference kernels in terms of simplified
      indexing annotated by the #pragma omp simd directive, which a compiler
      can use to vectorize certain constant-bounded loops. (The new kernels
      actually use _Pragma("omp simd") since the kernels are defined via
      templatizing macros.) Modest speedup was observed in most cases using
      gcc 5.4.0, which may improve with newer versions. Thanks to Devin
      Matthews for suggesting this via issue #286 and #259.
    - Updated default blocksizes defined in ref_kernels/bli_cntx_ref.c to
      be 4x16, 4x8, 4x8, and 4x4 for single, double, scomplex and dcomplex,
      respectively, with a default row preference for the gemm ukernel. Also
      updated axpyf, dotxf, and dotxaxpyf fusing factors to 8, 6, and 4,
      respectively, for all datatypes.
    - Modified configure to verify that -fopenmp-simd is a valid compiler
      option (via a new detect/omp_simd/omp_simd_detect.c file).
    - Added a new header in which prefetch macros are defined according to
      which compiler is detected (via macros such as __GNUC__). These
      prefetch macros are not yet employed anywhere, though.
    - Updated the year in copyrights of template license headers in
      build/templates and removed AMD as a default copyright holder.

commit 63de2b0090829677755eb5cdb27e73bc738da32d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jan 23 12:16:27 2019 -0600

    Prevent redef of ftnlen in blastest f2c_types.h.
    
    Details:
    - Guard typedef of ftnlen in f2c_types.h with a #ifndef HAVE_BLIS_H
      directive to prevent the redefinition of that type. Thanks to Jeff
      Diamond for reporting this compiler warning (and apologies for the
      delay in committing a fix).

commit eec2e183a7b7d67702dbd1f39c153f38148b2446
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jan 21 12:12:18 2019 -0600

    Added escaping to '/' in os_name in configure.
    
    Details:
    - Add os_name to the list of variables into which the '/' character is
      escaped. This is meant to address (or at least make progress toward
      addressing) #293. Thanks to Isuru Fernando for spotting this as the
      potential fix, and also thanks to M. Zhou for the original report.

commit adf5c17f0839fdbc1f4a1780f637928b1e78e389
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jan 18 15:14:45 2019 -0600

    Formally registered thunderx2 subconfiguration.
    
    Details:
    - Added a separate subconfiguration for thunderx2, which now uses
      different optimization flags than cortexa57/cortexa53.

commit 094cfdf7df6c2764c25fcbfce686ba29b933942c
Author: M. Zhou <5723047+cdluminate@users.noreply.github.com>
Date:   Fri Jan 18 18:46:13 2019 +0000

    Port BLIS to GNU Hurd OS. (#294)
    
    Prevent blis.h from misidentifying Hurd as OSX.

commit 5d7d616e8e591c2f3c7c2d73220eb27ea484f9c9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jan 15 20:52:51 2019 -0600

    README.md update re: mixeddt TOMS paper.

commit 58c7fb4788177487f73a3964b7a910fe4dc75941
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jan 8 17:00:27 2019 -0600

    Added more matlab scripts for mixeddt paper.
    
    Details:
    - Added a variant set of matlab scripts geared to producing plots that
      reflect performance data gathered with and without extra memory
      optimizations enabled. These scripts reside (for now) in
      test/mixeddt/matlab/wawoxmem.

commit 34286eb914b48b56cdda4dfce192608b9f86d053
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jan 8 11:41:20 2019 -0600

    Minor update to docs/HardwareSupport.md.

commit 108b04dc5b1b1288db95f24088d1e40407d7bc88
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jan 7 20:16:31 2019 -0600

    Regenerated symbols in build/libblis-symbols.def.
    
    Details:
    - Reran ./build/regen-symbols.sh after running
      'configure --enable-cblas auto' to reflect removal of
      bli_malloc_pool() and bli_free_pool().

commit 706cbd9d5622f4690e6332a89cf41ab5c8771899
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jan 7 18:28:19 2019 -0600

    Minor tweaks/cleanups to bli_malloc.c, _apool.c.
    
    Details:
    - Removed malloc_ft and free_ft function pointer arguments from the
      interface to bli_apool_init() after deciding that there is no need to
      specify the malloc()/free() for blocks within the apool. (The apool
      blocks are actually just array_t structs.) Instead, we simply call
      bli_malloc_intl()/_free_intl() directly. This has the added benefit
      of allowing additional output when memory tracing is enabled via
      --enable-mem-tracing. Also made corresponding changes elsewhere in
      the apool API.
    - Changed the inner pools (elements of the array_t within the apool_t)
      to use BLIS_MALLOC_POOL and BLIS_FREE_POOL instead of BLIS_MALLOC_INTL
      and BLIS_FREE_INTL.
    - Disabled definitions of bli_malloc_pool() and bli_free_pool() since
      there are no longer any consumers of these functions.
    - Very minor comment / printf() updates.

commit 579145039d945adbcad1177b1d53fb2d3f2e6573
Author: Minh Quan Ho <1337056+hominhquan@users.noreply.github.com>
Date:   Mon Jan 7 23:00:15 2019 +0100

    Initialize error messages at compile time (#289)
    
    * Initialize error messages at compile time
    
    - Assigning strings directly to the bli_error_string array, instead of
    snprintf() at execution-time.
    
    * Retired bli_error_init(), _finalize().
    
    Details:
    - Removed functions obviated by changes in 80e8dc6: bli_error_init(),
      bli_error_finalize(), and bli_error_init_msgs(), as well as calls to
      the former two in bli_init.c.
    
    * Regenerated symbols in build/libblis-symbols.def.
    
    Details:
    - Reran ./build/regen-symbols.sh after running
      'configure --enable-cblas auto'.

commit aafbca086e36b6727d7be67e21fef5bd9ff7bfd9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jan 7 12:38:21 2019 -0600

    Updated external package language in README.md.
    
    Details:
    - Updated/added comments about Fedora, OpenSUSE, and GNU Guix under the
      newly-renamed "External GNU/Linux packages" section. Thanks to Dave
      Love for providing these revisions.

commit daacfe68404c9cc8078e5e7ba49a8c7d93e8cda3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jan 7 12:12:47 2019 -0600

    Allow running configure with python 3.4.
    
    Details:
    - Relax version blacklisting of python3 to allow 3.4 or later instead
      of 3.5 or later. Thanks to Dave Love for pointing out that 3.4 was
      sufficient for the purpose of BLIS's build system. (It should be
      noted that we're not sure which, if any, python3 versions prior to
      3.4 are insufficient, and that the only thing stopping us from
      determining this is the fact that these earlier versions of python3
      are not readily available for us to test with.)
    - Updated docs/BuildSystem.md to be explicit about current python2 vs
      python3 version requirements.

commit cdbf16aa93234e0d6a80f0d0e385ec81e7b75465
Author: prangana <pradeep.rao@amd.com>
Date:   Fri Jan 4 15:59:21 2019 +0530

    Update version 1.3
    
    Change-Id: I32a7d24af860e87a60396614075236afb65a28a9

commit cf9c1150515b8e9cc4f12e0d4787b3471b12ba4a
Author: kdevraje <Kiran.Devrajegowda@amd.com>
Date:   Thu Jan 3 09:51:46 2019 +0530

     This commit adds a macro, which is to be enabled when BLIS is working on single instance mode
    
    Change-Id: I7f3fd654b78e64c4e6e24e9f0e245b1a30c492b0

commit ad8d9adb09a7dd267bbdeb2bd1fbbf9daf64ee76
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jan 3 16:08:24 2019 -0600

    README.md, CREDITS update.
    
    Details:
    - Added "What's New" and "What People Are Saying About BLIS" sections to
      README.md.
    - Added missing github handles to various individuals' entries in the
      CREDITS file.

commit 7052fca5aef430241278b67d24cef6fe33106904
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jan 2 13:48:40 2019 -0600

    Apply f272c289 to bli_fmalloc_noalign().
    
    Details:
    - Perform the same check for NULL return values and error message output
      in bli_fmalloc_noalign() as is performed by bli_fmalloc_align(). (This
      change was intended for f272c289.)

commit 528e3ad16a42311a852a8376101959b4ccd801a5
Merge: 3126c52e f272c289
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jan 2 13:39:19 2019 -0600

    Merge branch 'amd'

commit 3126c52ea795ffb7d30b16b7f7ccc2a288a6158d
Merge: 61441b24 8091998b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jan 2 13:37:37 2019 -0600

    Merge branch 'amd'

commit f272c2899a6764eedbe05cea874ee3bd258dbff3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jan 2 12:34:15 2019 -0600

    Add error message to malloc() check for NULL.
    
    Details:
    - Output an error message if and when the malloc()-equivalent called by
      bli_fmalloc_align() ever returns NULL. Everything was already in place
      for this to happen, including the error return code, the error string
      sprintf(), the error checking function bli_check_valid_malloc_buf()
      definition, and its prototype. Thanks to Minh Quan Ho for pointing out
      the missing error message.
    - Increased the default block_ptrs_len for each inner pool stored in the
      small block allocator from 10 to 25. Under normal execution, each
      thread uses only 21 blocks, so this change will prevent the sba from
      needing to resize the block_ptrs array of any given inner pool as
      threads initially populate the pool with small blocks upon first
      execution of a level-3 operation.
    - Nix stray newline echo in configure.

commit eb97f778a1e13ee8d3b3aade05e479c4dfcfa7c0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Dec 25 20:17:09 2018 -0600

    Added missing AMD copyrights to previous commit.
    
    Details:
    - Forgot to add AMD copyrights to several touched files that did not
      already have them in 2f31743.

commit 2f3174330fb29164097d664b7c84e05c7ced7d95
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Dec 25 19:35:01 2018 -0600

    Implemented a pool-based small block allocator.
    
    Details:
    - Implemented a sophisticated data structure and set of APIs that track
      the small blocks of memory (around 80-100 bytes each) used when
      creating nodes for control and thread trees (cntl_t and thrinfo_t) as
      well as thread communicators (thrcomm_t). The purpose of the small
      block allocator, or sba, is to allow the library to transition into a
      runtime state in which it does not perform any calls to malloc() or
      free() during normal execution of level-3 operations, regardless of
      the threading environment (potentially multiple application threads
      as well as multiple BLIS threads). The functionality relies on a new
      data structure, apool_t, which is (roughly speaking) a pool of
      arrays, where each array element is a pool of small blocks. The outer
      pool, which is protected by a mutex, provides separate arrays for each
      application thread while the arrays each handle multiple BLIS threads
      for any given application thread. The design minimizes the potential
      for lock contention, as only concurrent application threads would
      need to fight for the apool_t lock, and only if they happen to begin
      their level-3 operations at precisely the same time. Thanks to Kiran
      Varaganti and AMD for requesting this feature.
    - Added a configure option to disable the sba pools, which are enabled
      by default; renamed the --[dis|en]able-packbuf-pools option to
      --[dis|en]able-pba-pools; and rewrote the --help text associated with
      this new option and consolidated it with the --help text for the
      option associated with the sba (--[dis|en]able-sba-pools).
    - Moved the membrk field from the cntx_t to the rntm_t. We now pass in
      a rntm_t* to the bli_membrk_acquire() and _release() APIs, just as we
      do for bli_sba_acquire() and _release().
    - Replaced all calls to bli_malloc_intl() and bli_free_intl() that are
      used for small blocks with calls to bli_sba_acquire(), which takes a
      rntm (in addition to the bytes requested), and bli_sba_release().
      These latter two functions reduce to the former two when the sba pools
      are disabled at configure-time.
    - Added rntm_t* arguments to various cntl_t and thrinfo_t functions, as
      required by the new usage of bli_sba_acquire() and _release().
    - Moved the freeing of "old" blocks (those allocated prior to a change
      in the block_size) from bli_membrk_acquire_m() to the implementation
      of the pool_t checkout function.
    - Miscellaneous improvements to the pool_t API.
    - Added a block_size field to the pblk_t.
    - Harmonized the way that the trsm_ukr testsuite module performs packing
      relative to that of gemmtrsm_ukr, in part to avoid the need to create
      a packm control tree node, which now requires a rntm_t that has been
      initialized with an sba and membrk.
    - Re-enable explicit call bli_finalize() in testsuite so that users who
      run the testsuite with memory tracing enabled can check for memory
      leaks.
    - Manually imported the compact/minor changes from 61441b24 that cause
      the rntm to be copied locally when it is passed in via one of the
      expert APIs.
    - Reordered parameters to various bli_thrcomm_*() functions so that the
      thrcomm_t* to the comm being modified is last, not first.
    - Added more descriptive tracing for allocating/freeing small blocks and
      formalized via a new configure option: --[dis|en]able-mem-tracing.
    - Moved some unused scalm code and headers into frame/1m/other.
    - Whitespace changes to bli_pthread.c.
    - Regenerated build/libblis-symbols.def.

commit 61441b24f3244a4b202c29611a4899dd5c51d3a1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Dec 20 19:38:11 2018 -0600

    Make local copy of user's rntm_t in level-3 ops.
    
    Details:
    - In the case that the caller passes in a non-NULL rntm_t pointer into
      one of the expert APIs for a level-3 operation (e.g. bli_gemm_ex()),
      make a local copy of the rntm_t and use the address of that local copy
      in all subsequent execution (which may change the contents of the
      rntm_t). This prevents a potentially confusing situation whereby a
      user-initialized rntm_t is used once (in, say, gemm), and then found
      by the user to be in a different state before it is used a second
      time.

commit e809b5d2f1023b4249969e2f516291c9a3a00b80
Merge: 76016691 0476f706
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Dec 20 16:27:26 2018 -0600

    Merge branch 'master' into amd

commit 1f4eeee5175a8fc9ac312847c796ce6db5fe75b9
Author: sraut <Biplab.Raut@amd.com>
Date:   Wed Dec 19 21:21:10 2018 +0530

    Fixed BLAS test failures of small matrix SYRK for single and double precision.
    
    Details:
    - SYRK for small matrix was implemented by reusing small GEMM routine. This was
      resulting in output written to the full C matrix, and C being symmetric the
      lower and upper triangles of C matrix contained same results. BLAS SYRK API
      spec demands either lower or upper triangle of C matrix to be written with
      results. So, this was resulting in BLAS test failures, even though testsuite
      of BLIS was passing small SYRK operation.
    - To fix BLAS test failures of small matrix SYRK, separate kernel routines are
      implemented for small SYRK for both single and double precision. The newly
      added small SYRK routines are in file kernels/zen/3/bli_syrk_small.c.
      Now the intermediate results of matrix C are written to a scratch buffer.
      Final results are written from scratch buffer to matrix C using SIMD
      copy to either lower or upper traingle part of matrix C.
    - Source and header files frame/3/syrk/bli_syrk_front.c and
      frame/3/syrk/bli_syrk_front.h are changed to invoke new small SYRK routines.
    
    Change-Id: I9cfb1116c93d150aefac673fca033952ecac97cb

commit 6d267375c3a0543f20604d74cc678ad91db3b6f1
Author: sraut <Biplab.Raut@amd.com>
Date:   Wed Dec 19 14:22:21 2018 +0530

    This commit improves the performance of multi-instance DGEMM when these multiple threads are binded to a CCX.
    Multi-Instance: Each thread runs a sequential DGEMM.
    Change-Id: I306920c8061b6dad61efac1dae68727f4ac27df6

commit 0476f706b93e83f6b74a3d7b7e6e9cc9a1a52c3b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Dec 18 14:56:20 2018 -0600

    CHANGELOG update (0.5.1)

commit e0408c3ca3d53bc8e6fedac46ea42c86e06c922d (tag: 0.5.1)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Dec 18 14:56:16 2018 -0600

    Version file update (0.5.1)

commit 3ab231afc9f69d14493908c53c85a84c5fba58aa
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Dec 18 14:53:37 2018 -0600

    ReleaseNotes.md update in advance of next version.
    
    Details:
    - Updated ReleaseNotes.md in preparation for next version.

commit d1aa87164e1e82347d62aa98793963c5265ef7e7
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Dec 18 14:52:40 2018 -0600

    README.md update (External packages section).
    
    Details:
    - Updated External packages section in anticipation of introducing BLIS
      into Debian package universe. Thanks to M. Zhou for sponsoring BLIS in
      Debian.

commit 7bf901e9265a1acd78e44c06f7178c8152c7e267
Author: sraut <Biplab.Raut@amd.com>
Date:   Tue Dec 18 14:39:16 2018 +0530

    Fix on EPYC machine for multi instance performance issue,
    Issue: For the default values of mc, kc and nc with multi instance mode the performance across the cores dip drastically.
    Fix: After experimentation found different set of values (mc, kc and nc) which fits in the cache size, and performance across the remains same across all the cores.
    
    Change-Id: I98265e3b7e61cd7602a0cc5596240e86c08c03fe

commit d2b2a0819a2fccad9165bc48c0e172d79a87542c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 17 19:26:35 2018 -0600

    Removed stray sections from Multithreading.md.
    
    Details:
    - Removed unintended section headers from before table of contents.

commit 93d56319f2953cf0e9df1ff2cda90b8e41351b2c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 17 19:17:30 2018 -0600

    Added missing bli_init_once() in bli_thread API.
    
    Details:
    - Fixed an issue with specifying threading globally at runtime via
      bli_thread_set_num_threads() (the automatic way) or via
      bli_thread_set_ways() (the manual way), with bli_thread_init_rntm()
      also affected. These functions were not calling bli_init_once() prior
      to acting, and therefore their effects on the global rntm_t structure
      were being wiped out by the eventual call to bli_init_once(), by some
      other BLIS function. Thanks to Ali Emre Gülcü for reporting the
      behavior associated with this bug.
    - Added additional content to docs/Multithreading.md covering topics of
      choosing between OpenMP and pthreads, and specifying affinity via
      OpenMP.
    - CREDITS file update.

commit 76016691e2c514fcb59f940c092475eda968daa2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Dec 13 17:23:09 2018 -0600

    Improvements to bli_pool; malloc()/free() tracing.
    
    Details:
    - Added malloc_ft and free_ft fields to pool_t, which are provided when
      the pool is initialized, to allow bli_pool_alloc_block() and
      bli_pool_free_block() to call bli_fmalloc_align()/bli_ffree_align()
      with arbitrary align_size values (according to how the pool_t was
      initialized).
    - Added a block_ptrs_len argument to bli_pool_init(), which allows the
      caller to specify an initial length for the block_ptrs array, which
      previously suffered the cost of being reallocated, copied, and freed
      each time a new block was added to the pool.
    - Consolidated the "buf_sys" and "buf_align" pointer fields in pblk_t
      into a single "buf" field. Consolidated the bli_pblk API accordingly
      and also updated the bli_mem API implementation. This was done
      because I'd previously already implemented opaque alignment via
      bli_malloc_align(), which allocates extra space and stores the
      original pointer returned by malloc() one element before the element
      whose address is aligned.
    - Tweaked bli_membrk_acquire_m() and bli_membrk_release() to call
      bli_fmalloc_align() and bli_ffree_align(), which required adding an
      align_size field to the membrk_t struct.
    - Pass the pack schemas directly into bli_l3_cntl_create_if() rather
      than transmit them via objects for A and B.
    - Simplified bli_l3_cntl_free_if() and renamed to bli_l3_cntl_free().
      The function had not been conditionally freeing control trees for
      quite some time. Also, removed obj_t* parameters since they aren't
      needed anymore (or never were).
    - Spun-off OpenMP nesting code in bli_l3_thread_decorator() to a
      separate function, bli_l3_thread_decorator_thread_check().
    - Renamed:
        bli_malloc_align()   -> bli_fmalloc_align()
        bli_free_align()     -> bli_ffree_align()
        bli_malloc_noalign() -> bli_fmalloc_noalign()
        bli_free_noalign()   -> bli_ffree_noalign()
      The 'f' is for "function" since they each take a malloc_ft or free_ft
      function pointer argument.
    - Inserted various printf() calls for the purposes of tracing memory
      allocation and freeing, guarded by cpp macro ENABLE_MEM_DEBUG, which,
      for now, is intended to be a "hidden" feature rather than one hooked
      up to a configure-time option.
    - Defined bli_rntm_equals(), which compares two rntm_t for equality.
      (There are no use cases for this function yet, but there may be soon.)
    - Whitespace changes to function parameter lists in bli_pool.c, .h.

commit f808d829c58dc4194cc3ebc3825fbdde12cd3f93
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Dec 12 15:22:59 2018 -0600

    Handle edge cases, zero-filling in packm kernels.
    
    Details:
    - Updated the API and semantics of packm kernels such that they must now
      handle edge cases, meaning that a c-by-k packm kernel must be able to
      pack edge cases that are fewer than c rows/columns and be able to
      zero-fill the remaining elements. They must also be able to zero-fill
      the equivalent region when copying fewer than k columns/rows (which is
      needed by trsm). The new packm kernel API is generally:
    
        void packm_kernel
             (
               conj_t           conja,
               dim_t            cdim,
               dim_t            n,
               dim_t            n_max,
               ctype*  restrict kappa,
               ctype*  restrict a, inc_t inca, inc_t lda,
               ctype*  restrict p,             inc_t ldp,
               cntx_t* restrict cntx
             );
    
      where cdim and n are the dimensions (short and long, respectively) of
      the submatrix being copied from the source matrix A, and n_max is the
      "full" long dimension (corresponding to the k dimension in gemm) of
      the micropanel. The "full" short dimension (corresponding to the
      register blocksize MR or NR) is not part of the API because it is
      known intrinsically by the packm kernel implementation. Thanks to
      Devin Matthews for prompting us to make this change (#282).
    - Updated all reference packm kernels in ref_kernels/1m according to
      above changes, as well as all optimized packm kernels (which only
      consisted of those for knl).
    - Bumped the major soname version number in 'so_version' to 2. At first
      I was considering leaving it unchanged, but I couldn't escape the
      reality that the packm kernel API is much closer to an expert API
      than it is some obscure helper function interface within the framework
      that nobody would ever notice.
    - Removed reference packm kernels for mr/nr = 30. The only sub-config
      that would have been using those kernels is knc, which is likely no
      longer being used by very many people (if any). (This also mostly
      offset the larger object code footprint incurred by moving the edge-
      case handling into the individual packm kernels.)
    - Fixed an obscure race condition for 3mh and 4mh induced methods in
      which those implementations were modifying the contexts stored in the
      gks rather than a local copy.
    - Fixed a minor bug in the testsuite that prevented non-1m-based induced
      method implementations of trsm from executing.

commit 02ec0be3ba0b0d6b4186386ae140906a96de919b
Merge: e275def3 c534da62
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Dec 5 19:33:53 2018 -0600

    Merge branch 'master' into amd

commit c534da62c0015f91391983da5376c9e091378010
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Dec 5 15:51:05 2018 -0600

    Disabled ARM configuration families in registry.
    
    Details:
    - Disabled (commented out) the arm32 and arm64 configuration families
      in the config_registry file. Having a configuration family registered
      only makes sense if BLIS is currently outfitted with runtime hardware
      detection logic to choose the appropriate sub-configuration. That
      logic is currently missing for ARM architectures, and thus having the
      ARM configuration families in the configuration registry only serves
      to confuse people. Thanks to Devangi Parikh for suggesting this
      change.

commit 6885051a164628904fad0d8a3b39c82f9a7b193c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Dec 5 14:45:39 2018 -0600

    Generalizations/cleanup to mixeddt matlab scripts.
    
    Details:
    - Parameterized, reorganized, and added comments to matlab scripts in
      test/mixeddt/matlab.
    - Reordered some lines of code and added comments to plot_l3_perf.m in
      test/3m4m/matlab.

commit cbdb0566bf3201a495bbdcb8cb50342fa0098649
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Dec 5 20:06:32 2018 +0000

    Updates to 3m4m, mixeddt test driver files.
    
    Details:
    - Updated 3m4m and mixeddt Makefiles and runme.sh scripts, mostly to
      port recent changes to the former to the latter.
    - Disabled (for now) code in 3m4m/test_*.c files that disables all
      induced methods except for the one that is requested from the
      Makefile via the IND macro. This is done because usually, we want to
      test whatever method is enabled automatically for complex datatypes.
      (That is, when native complex microkernels are missing, we usually
      want to test performance of 1m.)

commit 0645f239fbdf37ee9d2096ee3bb0e76b3302cfff
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Dec 4 14:31:06 2018 -0600

    Remove UT-Austin from copyright headers' clause 3.
    
    Details:
    - Removed explicit reference to The University of Texas at Austin in the
      third clause of the license comment blocks of all relevant files and
      replaced it with a more all-encompassing "copyright holder(s)".
    - Removed duplicate words ("derived") from a few kernels' license
      comment blocks.
    - Homogenized license comment block in kernels/zen/3/bli_gemm_small.c
      with format of all other comment blocks.

commit 9b688a2d69dd420f4d2582827c5ac87e422cd3bc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Dec 4 13:30:25 2018 -0600

    Refer to color mm algorithm in Multithreading.md.

commit 22384fd2b749aa8cfdfad1084ce5e7dbd4ad2d64
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Dec 4 13:09:04 2018 -0600

    Minor updates to test_gemm.c in test/mixeddt.

commit 2ba3b1780cbca58e43a3948d67bd07e637036125
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 3 19:40:39 2018 -0600

    Removed symbols from libblis-symbols.def.
    
    Details:
    - Removed bli_gemm_md_front() and bli_gemm_md_zgemm() symbols from
      build/libblis-symbols.def, which will hopefully appease AppVeyor.

commit dcb38c4e59c3395c258799e69bfe2104c578c528
Merge: dc184095 375eb30b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 3 18:06:19 2018 -0600

    Merge branch 'dev'

commit 375eb30b0a63ac06a363a5f75f283584258db48b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 3 17:49:52 2018 -0600

    Added mixed-precision support to 1m method.
    
    Details:
    - Lifted the constraint that 1m only be used when all operands' storage
      datatypes (along with the computation datatype) are equal. Now, 1m may
      be used as long as all operands are stored in the complex domain. This
      change largely consisted of adding the ability to pack to 1e and 1r
      formats from one precision to another. It also required adding logic
      for handling complex values of alpha to bli_packm_blk_var1_md()
      (similar to the logic in bli_packm_blk_var1()).
    - Fixed a bug in several virtual microkernels (bli_gemm_md_c2r_ref.c,
      bli_gemm1m_ref.c, and bli_gemmtrsm1m_ref.c) that resulted in the wrong
      ukernel output preference field being read. Previously, the preference
      for the native complex ukernel was being read instead of the pref for
      the native real domain ukernel. This bug would not manifest if the
      preference for the native complex ukernel happened to be equal to that
      of the native real ukernel.
    - Added support for testing mixed-precision 1m execution via the gemm
      module of the testsuite.
    - Tweaked/simplified bli_gemm_front() and bli_gemm_md.c so that pack
      schemas are always read from the context, rather than trying to
      sometimes embed them directly to the A and B objects. (They are still
      embedded, but now uniformly only after reading the schemas from the
      context.)
    - Redefined cpp macro bli_l3_ind_recast_1m_params() as a static function
      and renamed to bli_gemm_ind_recast_1m_params() (since gemm is the only
      consumer).
    - Added 1m optimization logic (via bli_gemm_ind_recast_1m_params()) to
      bli_gemm_ker_var2_md().
    - Added explicit handling for beta == 1 and beta == 0 in the reference
      gemm1m virtual microkernel in ref_kernels/ind/bli_gemm1m_ref.c.
    - Rewrote various level-0 macro defs, including axpyris, axpbyris,
      scal2ris, and xpbyris (and their conjugating counterparts) to
      explicitly support three operand types and updated invocations to
      xpbyris in bli_gemmtrsm1m_ref.c.
    - Query and use the storage datatype of the packed object instead of the
      storage datatype of the source object in bli_packm_blk_var1().
    - Relocated and renamed frame/ind/misc/bli_l3_ind_opt.h to
      frame/3/gemm/ind/bli_gemm_ind_opt.h.
    - Various whitespace/comment updates.

commit e275def30ac41cadce296560fa67282704f20a02
Merge: 8091998b dc184095
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Nov 30 15:39:50 2018 -0600

    Merge branch 'master' into amd

commit dc18409551f341125169fe8d4d43ac45e81bdf28
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 28 11:58:40 2018 -0600

    CREDITS file update.

commit ee4d2712963816f84d7e3fdd39d93424e1aaf63d
Merge: e81c4b56 3d7e8bc3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 28 11:52:57 2018 -0600

    Merge pull request #287 from SuperFluffy/fix_configuration_links
    
    Fix configuration links

commit 3d7e8bc3b8e77693152138e75676f71573e5e6cd
Author: Richard Janis Goldschmidt <janis.beckert@gmail.com>
Date:   Wed Nov 28 15:56:37 2018 +0100

    Fix configuration links

commit 6a4885f8be9ecd81423ebf2eb6da75d7981c979b
Merge: 1d8aae22 e81c4b56
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Nov 27 13:22:59 2018 -0600

    Merge branch 'master' into dev

commit e81c4b56660b25a39f8fdc09fbe07459c5bd8e8e
Merge: 757043ea cfbdb58d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 21 17:00:49 2018 -0600

    Merge pull request #285 from isuruf/pthread
    
    Move LDFLAGS to the end

commit cfbdb58de2e44f2e3a3d8b14fceece7aef4b3006
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Wed Nov 21 14:23:39 2018 -0600

    Move LDFLAGS to the end
    
    Otherwise the linker will drop flags like -lpthread

commit 757043eae8630c0a76e9bb04f2cb0bd72439a86a
Merge: e769bf46 7af8fa01
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 21 13:07:26 2018 -0600

    Merge pull request #283 from isuruf/patch-3
    
    Fix MinGW and Cygwin build failures

commit 7af8fa01373b7bb30fa3b1fd110fd201c87ea225
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Wed Nov 21 02:10:05 2018 -0600

    Fix blis dll path

commit 2acd8dcd23805203a6821358c5e3e09d521fecdf
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Wed Nov 21 02:02:18 2018 -0600

    Fix install path of dll.a

commit b7b0ad22b151e89e2a6c7782cf4d8d47b4e60734
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Wed Nov 21 01:54:44 2018 -0600

    Test mingw

commit bafe521ed0012b7b8814404b78a6c576d8386370
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Wed Nov 21 01:54:36 2018 -0600

    Fixes for mingw

commit be831879bd03edcddff8a345161f749ad92215af
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Wed Nov 21 01:39:32 2018 -0600

    test gcc shared

commit f6b924648c79c4b1c3d3c7fbf85372680aff8362
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Wed Nov 21 01:39:19 2018 -0600

    Don't use .def for gcc

commit ce6e4eae6d5e977e6f699acc9cf239be8ac53771
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Wed Nov 21 01:34:56 2018 -0600

    test no threading

commit c9169b4685bfe81bc562cf9128b35a6a9884799b
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Wed Nov 21 01:17:36 2018 -0600

    Add mingw64 path

commit 0f753090eaf4264b743a49ce15de97514bcbe112
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Wed Nov 21 01:14:52 2018 -0600

    Fix PATH

commit d424470b1f2fa8717fa54c0245b21341504665f6
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Wed Nov 21 01:04:26 2018 -0600

    Check openmp and pthreads threading

commit c73e7601e58239e2dedec6c9f1b752e949254a42
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Wed Nov 21 00:50:33 2018 -0600

    Revert "enable rdp"
    
    This reverts commit 368274bcbd0c9232521d14fa28304f35ced0e6d7.

commit 6209b2e6060b89e65f3405c31333af8952dd63c0
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Wed Nov 21 00:50:22 2018 -0600

    Remove conda

commit 0b1b344447b8a2fcd635a48f0ce7ce89b2107dc4
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Wed Nov 21 00:42:39 2018 -0600

    Fix make name

commit 7a9838983ba8dd32ac9f87712255721542ff561f
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Wed Nov 21 00:35:27 2018 -0600

    Use m2w64-make

commit 4c1dedd6a90087807f16353a5d0bcaaade35a7a5
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Wed Nov 21 00:28:20 2018 -0600

    No activate on gcc

commit 368274bcbd0c9232521d14fa28304f35ced0e6d7
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Tue Nov 20 23:40:26 2018 -0600

    enable rdp

commit 707a5e7f9b07f554e1e9289dd0ce3b7dc4fded6e
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Tue Nov 20 23:39:31 2018 -0600

    No conda for mingw build

commit 65b0565c0ad9162d4474bd84eabde491fa971538
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Tue Nov 20 23:19:38 2018 -0600

    Check MinGW-w64

commit 9ddffba5847080e0d77d9e6059d05dc4b1d89ba5
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Wed Nov 21 00:23:34 2018 -0600

    Fix MinGW build failure
    
    Fixes https://github.com/flame/blis/issues/278

commit 1d8aae220bc52ce8e3a8afaa64b57e5d83480bdc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Nov 20 18:42:07 2018 -0600

    Track internal scalar datatypes.
    
    Details:
    - Added a num_t datatype bitfield to the obj_t in the form of a new
      info2 field in the obj_t. This change was made primarily so that in
      the case of mixed-datatype gemm, the alpha scalar would not need to
      be cast to the storage datatype of B (or A) before then being cast to
      the computation datatype just before the macrokernel is called. This
      double-casting regime could result in loss of precision if the storage
      datatype of B (or A) is less than the computation precision. In
      practice, it was likely not going to be a big deal since most usage of
      alpha is for -1.0, 0.0, and 1.0 (or integer multiples thereof), which
      can all be represented exactly in single or double precision.
    - The type of objbits_t was changed to uint32_t, so the new format
      potentially takes up the same space as the previous obj_t definition,
      assuming no padding inserted by the compiler. Shrinking info to 32
      bits and spilling over into a second field was chosen over using the
      high 32 bits of a single 64-bit objbits_t info field because many of
      the bitwise operations are performed with enums such as num_t, dom_t,
      and prec_t, which may take on the type of 32-bit ints. It's easier to
      just keep all of those bitwise operations in 32 bits than perform a
      million typecasts throughout bli_type_defs.h and bli_obj_macro_defs.h
      to ensure that the integers are treated as 64-bit for the purposes of
      the ANDs, ORs, and bitshifts.
    - Many comment updates.
    - Thanks to Devin Matthews and Devangi Parikh for their feedback and
      involvement during this commit cycle.

commit e769bf46b0931d68031af212110484ec98e16908
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Nov 20 16:16:53 2018 -0600

    Tweak testsuite to issue FAIL for Nan, Inf (#279).
    
    Details:
    - Adjusted the definition for libblis_test_get_string_for_result() in
      testsuite/src/test_libblis.c so that the "FAIL" string is returned if
      the computed residual contains either NaN or Inf. Previously, a
      residual containing NaN would result in the selection of the "PASS"
      string. Thanks to Devin Matthews for reporting this issue (#279).
    - Expounded on comment for the macro definitions of bli_isnan() and
      bli_isinf() in bli_misc_macro_defs.h to make it more obvious why they
      must remain macros.

commit 279deae18fb8b8106161863b46fcb38232314de4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Nov 16 11:34:19 2018 -0600

    Added 4x5 matlab plotting scripts to test/3m4m.
    
    Details:
    - Added a new directory, test/3m4m/matlab, containing matlab scripts for
      plotting 4x5 panels of performance graphs (using the subplot()
      function) for gemm, hemm, herk, trmm, and trsm across all four
      floating-point datatypes. I expect to further refine these scripts as
      time goes on, but their current state constitutes a good start.

commit 7b02c726650336c12286c8ba166d1d0fdf7601a8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 14 13:49:55 2018 -0600

    CREDITS file update.

commit 84dd298a27033945fa2d3b6e5dce1fe625cd2a0a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 14 13:47:45 2018 -0600

    Patch to fix msys2/Windows build failure (#277).
    
    Details:
    - Expanded cpp guard in frame/include/bli_x86_asm_macros.h to also check
      __MINGW32__ in addition to _WIN32, __clang__, and __MIC__. Thanks to
      Isuru Fernando for suggesting this fix, and also to Costas Yamin for
      originally reporting the issue (#277).

commit 8091998b6500e343c2024561c2b1aa73c3bafb0b
Merge: 333d8562 7b5ba731
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 14 12:36:35 2018 -0600

    Merge branch 'master' into amd

commit 7b5ba7319b3901ad0e6c6b4fa3c1d96b579efbe9
Merge: ce719f81 52392932
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 14 12:32:01 2018 -0600

    Merge branch 'dev' of github.com:flame/blis into dev

commit 52392932dc1ea3c16220cc4e6978efcb2f5f0616
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Nov 13 22:23:38 2018 +0000

    Minor fixes to test/3m4m drivers.
    
    Details:
    - Cleanups to Makefile to allow all test drivers to be built for
      OpenBLAS and MKL in addition to BLIS.
    - Fixed copy-paste typos in test_hemm in calls to ssymm_() and dsymm_().
    - Fixed incorrect types for betap in BLAS cpp macro branch of
      test_herk.c.

commit 4f12e36a0d0e6df146314b4e50e36c5e7a1af3d3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Nov 13 14:23:12 2018 -0600

    Fixed number of columns in first output line.
    
    Details:
    - In previous commit, forgot to remove output column corresponding to
      the k dimension.

commit a2e0cdd7debf8109198536d55af05d5631072fb2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Nov 13 14:15:11 2018 -0600

    Added hemm test driver to test/3m4m.
    
    Details:
    - Added a new test_hemm.c test driver to test/3m4m, which was modeled
      after the driver by the similar name in test. Also updated Makefile
      so that blis-nat-[sm]t would trigger builds for the new driver.

commit 0f9b53e84b48d8d73a56cc9889eae3595ca58a78
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Nov 13 13:03:15 2018 -0600

    Fixed a bug in high-level mixeddt conditional.
    
    Details:
    - Fixed a bug in frame/3/bli_l3_oapi.c in the conditional that divides
      use of induced method (1m) execution from native execution. The former
      was intended to only be used in cases where all storage datatypes are
      complex and the datatype of C is equal to the computation datatype.
      (If mixed datatypes are detected, native execution would be used.)
      However, the code in bli_gemm() was erroneously checking the execution
      datatype instead of the computation datatype, which at that point is
      guaranteed to be equal to the storage datatype even if the computation
      datatype contains a different value. Thanks to Devangi Parikh for
      helping in isolating this bug.

commit 333d8562f04eea0676139a10cb80a97f107b45b0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Nov 11 14:28:53 2018 -0600

    Added debug output to bli_malloc.c.
    
    Details:
    - Added debug output to bli_malloc.c in order to debug certain kinds of
      memory behavior in BLIS. The printf() statements are disabled and must
      be enabled manually.
    - Whitespace/comment updates in bli_membrk.c.

commit ce719f816d1237f5277527d7f61123e77180be54
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Nov 10 14:48:43 2018 -0600

    More edits to mixeddt matlab scripts.
    
    Details:
    - Renamed scripts in test/mixeddt/matlab:
        plot_case_all.m -> plot_dom_all.m
        plot_case_md.m  -> plot_dom_case.m
        plot_all_md.m   -> plot_dt_all.m
    - Added plot_dt_select.m in order to plot select graphs for the main
      body of the mixeddt paper, and added additional related legend
      handling in plot_gemm_perf.m.
    - Added test/mixeddt/matlab/output and a .gitkeep file within in order
      to force git to recognize the directory.

commit bf99e7c14baf45725b698d06ad043b531e3a2763
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Nov 8 18:47:17 2018 -0600

    Minor updates to test/mixeddt driver.
    
    Details:
    - Cleaned up test/mixeddt Makefile in preparation for gathering new
      data for mixeddt paper, including renaming implementations to
      "internal" and "ad-hoc" to match the terminology to be used in the
      paper.
    - Added new matlab scripts for generating 8 figures, each covering all
      mixed-precision cases for each mixed-domain case.
    - Updated the runme.sh script according to changes to Makefile.
    - Fixed a minor bug in test_gemm.c that may have given incorrect
      performance in complex, homogeneous storage datatype cases where
      the computation precision was equal to the storage precisions.
      (Examples: zzzd, cccs.)

commit 4bbb454bf3c361af9e97bfa394a73d610cd9002a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Nov 3 19:11:01 2018 -0500

    Testsuite docs update for mixed-datatype gemm.
    
    Details:
    - Updated docs/Testsuite.md to include mention of the new mixed-domain
      and mixed-precision settings, including descriptions.
    - Updated docs/MixedDatatypes.md to include a brief section on running
      the testsuite to exercise mixed-datatype functionality, which mostly
      amounts to a link to the Testsuite.md document.
    - Minor verbiage change to testsuite output to correct a misleading
      label associated with the value returned by the query function
      bli_info_get_simd_num_registers(). (The function does not return the
      number of SIMD registers present in the hardware, but rather a maximum
      assumed value for the purposes of allocating temporary microtile
      workspace on the function stack.)

commit 16401ae922b1285437cf5f6867b2764650a95fb0
Merge: f19c33af 2d403a15
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Nov 3 19:09:43 2018 -0500

    Merge branch 'dev'

commit 2d403a1535380a2ebe2ae2c0f5ac54ba7564fbeb
Merge: e90e7f30 4a12979f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Nov 1 20:18:53 2018 -0500

    Merge pull request #275 from RhysU/patch-1
    
    Spelling in FAQ

commit 4a12979f65697ed79ba290efd59f4b994ac9429b
Author: Rhys Ulerich <rhys.ulerich@gmail.com>
Date:   Thu Nov 1 20:20:59 2018 -0400

    Spelling in FAQ

commit f19c33af4cbe6f5705b96fbf2b8799c3c2bd75c3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 26 17:07:15 2018 -0500

    Disallow 64b BLAS integers + 32b BLIS integers.
    
    Details:
    - Print an error message from configure if the user attempts to
      explicitly configure BLIS for simultaneous use of 64-bit integers in
      the BLAS API with 32-bit integers in the BLIS API.
    - Added cpp macro conditional to bli_type_defs.h to mandate that BLIS
      integers be 64 bits if the BLAS integers are 64 bits. This and the
      above item take care of issue #274. Thanks to Devin Matthews and
      Jeff Hammond for suggesting these safeguards.
    - Slight reorganization and relabeling (for clarity) of BLAS/CBLAS
      sections and BLIS integer size line of the testsuite configuration
      output.
    - Very minor edits to docs/MixedDatatypes.md.

commit e90e7f309b3f2760a01e8e09a29bf702754fa2b5 (origin/win-pthreads)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 25 14:09:43 2018 -0500

    CHANGELOG update (0.5.0)

commit be7c57819cfd48adb175d9a480cc9f37928645c1 (tag: 0.5.0)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 25 14:09:40 2018 -0500

    Version file update (0.5.0)

commit 75da7f2a208ad7d26ed9c6d3e10d08b2a1caf9d6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 25 14:02:41 2018 -0500

    ReleaseNotes.md update in advance of next version.
    
    Details:
    - Updated ReleaseNotes.md in preparation for next version.
    - Updated docs/FAQ.md to reflect recent developments, and other edits.
    - Minor updates to RELEASING.

commit 6fbc456fb3f4401ec951a618990f15a84fdfa236
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 25 13:20:25 2018 -0500

    Added SALT testing to Travis CI.
    
    Details:
    - Modified .travis.yml to automatically employ the simulation of
      application-level threading within the testsuite, with supporting
      changes to common.mk, the top-level Makefile, and
      travis/do_testsuite.sh.
    - Added a new pair of input files to testsuite directory with the
      '.salt' suffix (similar to those with the '.fast' suffix) for
      testing application-level threading.
    - Updated docs/BuildSystem.md to document the new make targets
      'testblis-salt' and 'checkblis-salt'.

commit 0e27963a6770e6b64f3299ad0613d5df45d8b6ae
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 24 12:16:19 2018 -0500

    Add bli_pthread_mutex_trylock().
    
    Details:
    - Added the missing bli_pthread_mutex_trylock() function and prototype
      to the non-Windows sections of bli_pthread.c and .h. This function
      isn't needed by BLIS, but I figured why not make the Windows and
      non-Windows sections consistent with one another.

commit 4b683740c12f83804a51ec610b16ce28607d5c85
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 24 11:56:16 2018 -0500

    Defined bli_pthread_cond_*() and related defs.
    
    Details:
    - Added function definitions for bli_pthread_cond_*() as well as related
      types and constants to bli_pthread.c, and corresponding prototypes to
      bli_pthread.h.

commit 4b4f8072b9bb495b3e01d45698b0bad3dac31ba8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 24 11:31:46 2018 -0500

    Define bli_pthreads barrier types on OS X.
    
    Details:
    - Fully define bli_pthreads barrier-related types on OS X. Only typedef
      those types in terms of pthreads types on non-Windows, non-Apple OSes
      (i.e. Linux).

commit ad98790dcef6bd9aab7f13d615b987b5daa58757
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Oct 23 20:35:05 2018 -0500

    Fix names of Windows pthread initializer macros.
    
    Details:
    - Renamed the PTHREAD_ initializer macros in the Windows cpp case to use
      BLIS_ prefixes to match their non-Windows counterparts.

commit 06c23954e6b17219a50c3d37821544a46defaf89
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Oct 23 19:16:54 2018 -0500

    Defined unified bli_pthreads_*() API for all OSes.
    
    Details:
    - Expanded the bli_pthread_*() -> pthread_*() wrappers in
      frame/thread/bli_pthread.c to include cases for Windows taken from
      frame/base/bli_pthread_wrap.c. Now, bli_thread_*() is always defined
      and always used by BLIS and the BLIS testsuite (in lieu of calling
      pthreads directly, as before). The implementation used in this new
      API depends on whether we are building for Windows, and to a lesser
      extent, whether we are building on OS X. For the core API, Windows
      uses Windows threads, non-Windows (Linux, OS X) uses pthreads.
      OS X and Windows get barriers implemented in terms of other
      bli_pthread_*() functions, and Linux gets barriers implemented in
      terms of pthread_barrier*(). This commit addresses issue #273.
    - Fixed a bug in the Linux definition of bli_pthread_mutex_unlock(),
      which was erroneously calling pthread_mutex_lock().
    - Minor changes to configure so that the auto-detection executable
      can be built given the above changes (most notably, turning on
      POSIX extensions via -D_GNU_SOURCE).
    - Removed temporary play-test code for shiftd that accidentally got
      committed into test/3m4m/test_gemm.c.

commit 0ae9585da1e3db1cf8034d4b16305a5883beb0d3
Author: pradeeptrgit <pradeep.rao@amd.com>
Date:   Tue Oct 23 09:36:23 2018 +0530

    Update version number to 1.2
    
    Change-Id: Ibb31f6683cdecca6b218bc2f0c14701d7e92ebf3

commit eac7d267a017d646a2c5b4fa565f4637ebfd9da7
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 22 18:10:59 2018 -0500

    Unconditionally define bli_l3_thread_entry().
    
    Details:
    - Define a dummy bli_l3_thread_entry() function when multithreading is
      disabled altogether, or enabled via OpenMP. This function was
      originally necessary when multithreading is enabled via pthreads.
      By defining the function no matter the threading options given, it is
      less likely that an AppVeyor Windows build will complain due to a
      missing symbol in the DLL. (To be clear: AppVeyor was working fine
      before, but a problem may have arisen if it were switched to an
      OpenMP build.)
    - Removed the prototype for bli_l3_thread_entry() from
      bli_thrcomm_pthreads.c and placed it in bli_thrcomm.h.
    - Regenerated the symbols list file build/libblis-symbols.def.

commit 4ee986f0a74207f4ca29df077929134725d62b80
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 22 14:09:44 2018 -0500

    Added mixed-datatype testing to Travis CI (#271).
    
    Details:
    - Modified .travis.yml to automatically test the mixed-datatype support
      of the gemm operation, with supporting changes to common.mk, the
      top-level Makefile, and travis/do_testsuite.sh.
    - Added a new pair of input files to testsuite directory with the
      '.mixed' suffix (similar to those with the '.fast' suffix) for testing
      mixed-datatype gemm.
    - Updated docs/BuildSystem.md to document the new make targets
      'testblis-md' and 'checkblis-md'.

commit c3c6ebc9c6244053d654a9b0c955acb2fef42ee8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Oct 21 18:48:54 2018 -0500

    Fixed thrinfo_t printing for small problems.
    
    Details:
    - Fixed a bug in the code that prints out the communicator and work ids
      from the various threads' thrinfo_t nodes. This bug manifested when
      the dimension being parallelized was not large enough such that every
      thread was assigned actual work (since the minimum amount of work is
      determined by the register blocksize in the dimension being
      parallelized). In those cases, the threads that receive no work in
      that dimension do not finish building their thrinfo_t tree, leaving
      lower-level nodes non-existent. (The bug itself was usally observed as
      a segfault when the printing code attempted to dereference all the way
      down the thrinfo_t tree.) The solution involves explicitly checking
      each node as it is dereferenced, and if at any time NULL is found, all
      subsequent communicator and work ids are set to -1.

commit 73a222c0d99dcc221be7dea10eaebf844f31f72e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Oct 20 14:13:04 2018 -0500

    Minor edits to 'configure --help' text.

commit 14f3d5e6df183819a0c393b2661ad15df0786544
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 19 20:39:35 2018 -0500

    Refresh libblis-symbols.def post-merge 090e4f0.

commit 090e4f08fc2f429a1b2db77b0a6f8276f892a7ac
Merge: c9be5889 0854e880
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 19 18:41:10 2018 -0500

    Merge branch 'master' into dev

commit 0854e880b0848e0c2e3d0644c93c80b0fd13c0dc
Merge: 4e38a8d4 343a2715
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 19 18:05:00 2018 -0500

    Merge pull request #261 from flame/win-pthreads
    
    Implement missing pthreads function on Windows

commit c9be5889fbe947c64ef75740662e4d63032f4c35
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 19 17:42:40 2018 -0500

    Added "Known issues" section to Multithreading.md.
    
    Details:
    - Added known issues section to Multithreading.md.
    - Trivial changes to MixedDatatypes.md, Sandboxes.md.

commit 343a2715ebee28d250ee41b914abdcd1dc77c344
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 19 16:59:19 2018 -0500

    Whitespace changes to configure, bli_pthread_wrap.
    
    Details:
    - Mostly whitespace changes (spaces to tabs) to configure and
      bli_pthread_wrap.c and .h.

commit 3678a1cd518df9447b4b1ea86885eb2ba8abcf6e
Merge: 85397cd4 4e38a8d4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 19 16:11:31 2018 -0500

    Merge branch 'master' into win-pthreads

commit 4e38a8d4eebb18ead74e644fac76a4fde8e7f6c6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 19 15:54:15 2018 -0500

    Implemented python version checking in configure.
    
    Details:
    - Added python version checking to configure script. (Recall that python
      is needed to execute the flatten-headers.py script.) Minimum versions
      of python needed are currently as follows:
        python2: 2.7 or later
        python3: 3.5 or later
      The standard search order for python interpeters is:
        python python3 python2
      The PYTHON environment variable is also supported and will be checked
      before the standard search order list.
    - Updated BuildSystem.md to include: a minimum make version; mention
      that the C compiler must actually be a C99 compiler; and the caveat
      that Windows builds do not require pthreads since BLIS can provide
      an implementation of pthreads internally.

commit 85397cd4fa52f6c4c33f4fb715478c55533c680e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 19 13:12:43 2018 -0500

    Added explanatory comment to bli_pthread.c.
    
    Details:
    - Added a verbose comment to bli_pthread.c that explains why a bli_
      wrapper to pthreads APIs is useful.

commit 53c07035ef61cc9b8469636d4d8fa5085f37652d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 19 12:53:03 2018 -0500

    Refresh libblis-symbols.def from bb6df28.
    
    Details:
    - Forgot to regenerate the symbols file after the previous commit
      (bb6df281) in which shiftd operation was introduced.

commit 473ce54f5fbea4860ac0514e7e8b022c1ea03e63
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 18 19:03:56 2018 -0500

    Added bli_pthread_*() API.
    
    Details:
    - Defined a bli_pthread_*() API so that the testsuite, when being linked
      against a Windows DLL, will be able to access pthreads functionality
      without those pthreads functions being explicitly exported by the DLL.
      Instead, we export the bli_pthread_*() layer, which uses types and
      functions that are identical to pthreads, but adds a 'bli_' prefix.
      Only a few basic functions are present in the bli_pthreads_*() API
      for now. Thanks to Devin Matthews and Isuru Fernando for their help
      on a related PR (#261) that this commit will hopefully facilitate.
    - Updated testsuite so that it calls bli_pthread_*() layer instead of
      pthread_*() functions directly.
    - Regenerated build/libblis-symbols.def.
    - Comment updated to build/regen-symbols.sh.

commit bb6df2814fcaa2fa62a549379f61be2f8667a598
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 18 17:11:39 2018 -0500

    Defined a new level-1d operation: shiftd.
    
    Details:
    - Defined a new level-1d operation called 'shiftd', including object and
      typed APIs. This operation adds a scalar value to every element along
      an arbitrary diagonal of a matrix. Currently, shiftd is implemented in
      terms of the addv kernel. (The scalar is passed in as the x vector
      with an increment of zero.)
    - Replaced ad-hoc usage of setd and addd (after creating a temporary
      matrix object) with use of shiftd, which is much more concise, in
      various test driver files in the testsuite. Similar changes were made
      to the standalone test drivers and the example code.
    - Added documentation entries in BLISObjectAPI.md and BLISTypedAPI.md
      for bli_shiftd() and bli_?shiftd(), respectively.
    - Added observed object properties to level-1d documentation in
      BLISObjectAPI.md.

commit 53e0a0c9b38e8525c7224e280342ef56328af567
Merge: 1c7247b6 ec676799
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 18 14:54:59 2018 -0500

    Merge branch 'master' into win-pthreads

commit ec67679990660a60362a49406595383672812287
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 18 14:27:02 2018 -0500

    Refreshed Windows symbol list; added regen script.
    
    Details:
    - Moved windows/build/libblis-symbols.def to build/libblis-symbols.def.
      Updated link commands in common.mk accordingly.
    - Added a new script build/regen-symbols.sh that will regenerate the
      libblis-symbols.def file in its new location after building a
      haswell-targeted shared library. Thanks to Isuru Fernando for
      providing the symbol generation command.
    - Ran the new script to refresh the symbols file.

commit fdad54ab8eee4a7efd04ec4afb3e6902eb22e60a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 18 12:43:22 2018 -0500

    Removed old symbol from libblis-symbols.def.
    
    Details:
    - Removed bli_gemm_ker_var1() from windows/build/libblis-symbols.def
      since this function is no longer compiled.

commit 49d3f9fcbb4a75553439f97c099ea48d85763eea
Merge: 779d64dc 3c527256
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 17 18:00:40 2018 -0500

    Merge branch 'master' into dev

commit 3c52725693d0d7726e1c8fb224f9b1ef786db8b9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 17 14:56:22 2018 -0500

    Renamed/moved l3 zen ukernels to haswell kernel set.
    
    Details:
    - Renamed the microkernels in kernels/zen/3 to kernels/haswell/3 and
      then updated the file contents to use the 'haswell' infix.
    - Updated bli_cntx_init_zen.c and bli_cntx_init_haswell.c according to
      above function renames.
    - Moved/updated the corresponding prototypes in bli_kernels_zen.h to
      bli_kernels_haswell.h.
    - Updated config_registry according to above changes.
    - NOTE: This rename reflects the fact that haswell microkernels are
      specifically written to overcome the floating-point latency for FMA
      instructions on Intel Haswell-like architectures, which can issue two
      FMA instructions per cycle. These ukernels happen to work fine on AMD
      Zen-based architectures. However, Zen only issues one FMA per cycle,
      which, while halving its floating-point throughput, gives it extra
      flexibility in the design of its microkernels--namely, mr and nr can
      be smaller and still overcome the floating-point latency for those
      single-issue cores. A smaller value of mr and nr allows for a larger
      value of kc, which may be useful in some situations. In the future,
      we may write such Zen-specific microkernels to take advantage of this
      additional flexibility.

commit 71c5832d5f5596f25204980803423d08143a4010
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 17 14:11:01 2018 -0500

    Consolidated slab/rr-explicit level-3 macrokernels.
    
    Details:
    - Consolidated the *sl.c and *rr.c level-3 macrokernels into a single
      file per sl/rr pair, with those files named as they were before
      c92762e. The consolidation does not take away the *option* of using
      slab or round-robin assignment of micropanels to threads; it merely
      *hides* the choice within the definitions of functions such as
      bli_thread_range_jrir(), bli_packm_my_iter(), and bli_is_last_iter()
      rather than expose that choice explicitly in the code. The choice of
      slab or rr is not always hidden, however; there are some cases
      involving herk and trmm, for example, that require some part of the
      computation to use rr unconditionally. (The --thread-part-jrir option
      controls the partitioning in all other cases.)
    - Note: Originally, the sl and rr macrokernels were separated out for
      clarity. However, aside from the additional binary code bloat, I later
      deemed that clarity not worth the price of maintaining the additional
      (mostly similar) codes.

commit 57eab3a4f0e43099fc2ff189df9fcc0d7801c2cd
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 17 11:29:20 2018 -0500

    CREDITS file update.

commit 6722ec21817cbab9d86ee63f00984eb407b5e627
Author: Ye Luo <xw111luoye@gmail.com>
Date:   Wed Oct 17 11:26:00 2018 -0500

    Fix bgclang compilation on BGQ (#270)
    
    * Fix bgq kernels
    
    * Support bgq with bgclang

commit 1c7247b6d146fc728d7c4240e4e069e33f8f8868
Merge: c1bc5530 6c5a1aaf
Author: Devin Matthews <damatthews@smu.edu>
Date:   Tue Oct 16 14:44:32 2018 -0500

    Merge branch 'win-pthreads' of github.com:flame/blis into win-pthreads

commit c1bc5530d51bf55b4aa3c35165f6d4452a0fd779
Author: Devin Matthews <damatthews@smu.edu>
Date:   Tue Oct 16 14:44:10 2018 -0500

    Don't call pthread_once in auto-detect.

commit b9c61d03f542a2e92551ff0595415bec3076ab25
Merge: 5a1e461f 3612ecac
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Oct 16 14:39:57 2018 -0500

    Merge branch 'nested-omp-patch'

commit 5a1e461ffe09ed200ee2fc7aafccf6dd7e8c0080
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Oct 16 14:21:45 2018 -0500

    Execute flatten-headers.py via $(PYTHON).
    
    Details:
    - Execute build/flatten-headers.py python script via $(PYTHON) in
      common.mk. This allows distributions that define the current/preferred
      python interpreter in the PYTHON environment variable to use that
      interpreter when executing flatten-headers.py. Thanks to Isuru
      Fernando for this suggestion, and for Dave Love for submitting the
      initial issue/request.

commit 6c5a1aaff540b19672e91501e894ed695aee322b
Author: Devin Matthews <damatthews@smu.edu>
Date:   Tue Oct 16 10:15:59 2018 -0500

    Fix type in bli_pthread_wrap.c

commit 29e6245816760b1bd4ac738d7d3e11a9d9d13473
Merge: 0b73209f ed657714
Author: Devin Matthews <damatthews@smu.edu>
Date:   Tue Oct 16 10:12:25 2018 -0500

    Merge branch 'master' into win-pthreads

commit 0b73209f6b22cc024169146d343627f6999b63d8
Author: Devin Matthews <damatthews@smu.edu>
Date:   Tue Oct 16 10:02:06 2018 -0500

    Add missing argument to WaitForSingleObject and use $is_win in configure
    to turn off pthreads.

commit ed65771482a705f7ed028d822489766327b44e76
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 15 17:54:45 2018 -0500

    Fixed merge fail on testsuite threading macros.
    
    Details:
    - Applied the following C preprocessor macro renames
    
        BLIS_DEFAULT_MR_THREAD_MAX  -> BLIS_THREAD_MAX_IR
        BLIS_DEFAULT_NR_THREAD_MAX  -> BLIS_THREAD_MAX_JR
        BLIS_DEFAULT_M_THREAD_RATIO -> BLIS_THREAD_RATIO_M
        BLIS_DEFAULT_N_THREAD_RATIO -> BLIS_THREAD_RATIO_N
    
      in src/test_libblis.c. This is apparently the result of a failure by
      git to properly merge the 'master' and 'amd' branches in the previous
      commit. (The 'master' branch contained a commit, 53a9ab1, in which
      these same cpp macros were renamed throughout the source distribution.

commit dc5fd898af8c74c2e2a75fc647157da0d04dd922
Merge: 667d3929 637c2ce7
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 15 17:41:35 2018 -0500

    Merge branch 'amd'

commit 779d64dc3091dea6b7530283304e52878151d218
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 15 17:13:18 2018 -0500

    Added entry for xpbym to input.operations.fast.
    
    Details:
    - Forgot to add an entry for the new xpbym operation to
      input.operations.fast in previous commit.

commit 5fec95b99f61761963834f62a9867f797687813c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 15 16:37:39 2018 -0500

    Implemented mixed-datatype support for gemm.
    
    Details:
    - Implemented support for gemm where A, B, and C may have different
      storage datatypes, as well as a computational precision (and implied
      computation domain) that may be different from the storage precision
      of either A or B. This results in 128 different combinations, all
      which are implemented within this commit. (For now, the mixed-datatype
      functionality is only supported via the object API.) If desired, the
      mixed-datatype support may be disabled at configure-time.
    - Added a memory-intensive optimization to certain mixed-datatype cases
      that requires a single m-by-n matrix be allocated (temporarily) per
      call to gemm. This optimization aims to avoid the overhead involved in
      repeatedly updating C with general stride, or updating C after a
      typecast from the computation precision. This memory optimization may
      be disabled at configure-time (provided that the mixed-datatype
      support is enabled in the first place).
    - Added support for testing mixed-datatype combinations to testsuite.
      The user may test gemm with mixed domains, precisions, both, or
      neither.
    - Added a standalone test driver directory for building and running
      mixed-datatype performance experiments.
    - Defined a new variation of castm, castnzm, which operates like castm
      except that imaginary values are not touched when casting a real
      operand to a complex operand. (By contrast, in these situations castm
      sets the imaginary components of the destination matrix to zero.)
    - Defined bli_obj_imag_is_zero() and substituted calls in lieu of all
      usages of bli_obj_imag_equals() that tested against BLIS_ZERO, and
      also simplified the implementation of bli_obj_imag_equals().
    - Fixed bad behavior from bli_obj_is_real() and bli_obj_is_complex()
      when given BLIS_CONSTANT objects.
    - Disabled dt_on_output field in auxinfo_t structure as well as all
      accessor functions. Also commented out all usage of accessor
      functions within macrokernels. (Typecasting in the microkernel is
      still feasible, though probably unrealistic for now given the
      additional complexity required.)
    - Use void function pointer type (instead of void*) for storing function
      pointers in bli_l0_fpa.c.
    - Added documentation for using gemm with mixed datatypes in
      docs/MixedDatatypes.md and example code in examples/oapi/11gemm_md.c.
    - Defined level-1d operation xpbyd and level-1m operation xpbym.
    - Added xpbym test module to testsuite.
    - Updated frame/include/bli_x86_asm_macros.h with additional macros
      (courtsey of Devin Matthews).

commit 3612ecac98a9d36c3fcd64154121d420bb69febd (origin/nested-omp-patch)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 11 15:16:41 2018 -0500

    Added comments to nested OpenMP handling code.
    
    Details:
    - Added comments to bli_thrcomm_openmp.c relating to changes made in
      6ac0c80 and 1064d79.

commit 667d3929ee20e94849b4e25b693b4037b7e3f350
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 11 11:47:57 2018 -0500

    Added Fortran APIs for some thread functions.
    
    Details:
    - Defined Fortran-77 compatible APIs for bli_thread_set_num_threads()
      and bli_thread_set_ways(). These wrappers are defined in
      frame/compat/blis/thread/b77_thread.c. Thanks to Kay Dewhurst for
      suggesting these new interfaces.
    - Added missing prototype for bli_thread_set_ways() in bli_thread.h and
      removed prototypes for non-existent functions bli_thread_set_*_nt().
    - CREDITS file update.

commit 1064d79711f03a0541b92d8b8b9b7e25e04097a5
Author: Devin Matthews <damatthews@smu.edu>
Date:   Thu Oct 11 11:14:25 2018 -0500

    Adjust rntm_t struct as well.

commit 6ac0c805609b85616ddb32e50101c4f9feb25a35
Author: Devin Matthews <damatthews@smu.edu>
Date:   Thu Oct 11 10:45:07 2018 -0500

    Fix OMP nesting problem.
    
    Detect when OpenMP uses fewer threads than requested and correct accordingly, so that we don't wait forever for nonexistent threads. Fixes #267.

commit 78a6935483409ae277c766406e175772e820b1de
Author: sraut <Biplab.Raut@amd.com>
Date:   Thu Oct 11 10:49:40 2018 +0530

    Added comments for the change in syrk small matrix change.
    
    Change-Id: I958939e9953323730da49ef07d1b10e578837d82

commit 53a9ab1c85be14dcfd2560f5b16e898e3e258797
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 10 15:11:09 2018 -0500

    Renamed thread auto-factorization macro constants.
    
    Details:
    - Renamed the following C preprocessor macros whose fallback/default
      values are specified within frame/include/bli_kernel_macro_defs.h:
    
        BLIS_DEFAULT_MR_THREAD_MAX  -> BLIS_THREAD_MAX_IR
        BLIS_DEFAULT_NR_THREAD_MAX  -> BLIS_THREAD_MAX_JR
        BLIS_DEFAULT_M_THREAD_RATIO -> BLIS_THREAD_RATIO_M
        BLIS_DEFAULT_N_THREAD_RATIO -> BLIS_THREAD_RATIO_N
    
    - Renamed the above cpp macro overrides within the knl, skx, and zen
      sub-configurations, as well as invocations of those macros in
      bli_rntm.c.
    - Moved config/zen/bli_kernel.h to an 'old' directory as it is no longer
      used by any code within BLIS.

commit 637c2ce794b0414ba8b25e9a452f7d64f825d63a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Oct 9 17:18:04 2018 -0500

    Updated column index range for irun.py -q.
    
    Details:
    - Forgot to apply the column index range fix in 10f179f to situations
      when "quiet" mode (-q) is requested. This commit applies the new
      column index range modifications to the quiet case.

commit e2a59400bdda7ed7ee0ff00edea70c00ed593b6c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Oct 9 15:29:48 2018 -0500

    Allow trsm_l parallelism in the jc loop.
    
    Details:
    - Previously, trsm was consolidating all ways of parallelism into the jr
      loop. This was unnecessary and to some degree detrimental on some
      types of hardware. Now, any parallelism bound for the jc loop will be
      applied to the jc loop, while all other loops' parallelism is funneled
      to the jr loop. Thanks to Devangi Parikh for helping investigate this
      issue and suggesting the fix.
    - NOTE: This change affects only left-side trsm. However, currently
      right-side trsm is currently implemented in terms of the left-side
      case, and thus the change effectively applies to both left and right
      cases.

commit f1dba506c970f14e612580d3c171e7c5ffd0a5fb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 8 17:59:41 2018 -0500

    Output threading status/params from testsuite.
    
    Details:
    - Updated testsuite to output various parameters related to parallelism
      in BLIS. These parameters include:
      - threading status: disabled, openmp, or pthreads;
      - thread partitioning for jr/ir loops: slab or rr (round-robin);
      - ways of parallelism from environment variables, and also actual
        values used by gemm, herk, trmm_l, trmm_r, trsm_l, and trsm_r for
        square problems (assuming all dimensions are set to 1000);
      - automatic thread factorization parameters.
    - Also output the status of two relatively new configure-time options:
      libmemkind and the sandbox.

commit 10f179fb13fc1179921a4ef8efdd2174f01e07da
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 8 14:36:38 2018 -0500

    Updated irun.py to use updated column index range.
    
    Details:
    - Updated the irun.py script so that it updates the matlab column index
      range (if found) to reflect the additional columns of data that are
      substituted in. Thanks to Devangi Parikh for recognizing and reporting
      this issue.

commit c244a716c97849dee41f52b5f424116aae1b710b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Oct 7 20:59:40 2018 -0500

    Added missing -r option to configure --help output.
    
    Details:
    - Added inadvertantly-omitted mention of -r option-equivalent to
      --thread-part-jrir to the output for 'configure --help'. Also made
      minor edits to the same text.

commit c92762ecdca1eb0b08c8acd583b4739a1e3fbd39
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Oct 7 20:30:32 2018 -0500

    Added option of slab or rr partitioning in jr/ir.
    
    Details:
    - Updated existing macrokernel function names and definitions to
      explicitly use slab assignment of micropanels to threads, then created
      duplicate versions of macrokernels that explicitly use round-robin
      assignment instead of slab. NOTE: As in ac18949, trsm_r macrokernels
      were not substantially updated in this commit because they are
      currently disabled in bli_trsm_front.c.
    - Updated existing packing function (in blk_packm_blk_var1.c) to
      explicitly use slab partitioning, and then duplicated for round-robin.
    - Updated control tree initialization to use the appropriate macrokernel
      and packm function pointers depending on which method (slab or rr) was
      enabled at configure-time.
    - Updated configure script to accept new --thread-part-jrir=[slab|rr]
      option (-m [slab|rr] for short), which allows the user to explicitly
      request either slab or round-robin assignment (partitioning) of
      micropanels to threads.
    - Updated sandbox/ref99 according to above changes.
    - Minor updates to build/add-copyright.py.

commit 98e01ea04bfe1032e5bd4781043afd84f864a19e
Merge: ac18949a 541b8a3b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 4 20:44:12 2018 -0500

    Merge branch 'master' into amd

commit 541b8a3b3e9af4078f5e6fb2f9608d681839952a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 4 20:39:06 2018 -0500

    Removed 1h short-circuit from bli_clock_min_diff().
    
    Details:
    - Removed a guard from bli_clock_min_diff() that would return 0 if the
      time delta was greater than 60 minutes. This was originally intended
      to disregard extremely large values under the assumption that the
      user probably didn't intend to run a test that long. However, since
      it is in bli_clock_min_diff(), it doesn't actually help short-circuit
      an implementation that is hanging or looping infinitely, since such
      an implementation would first have to finish before the
      bli_clock_min_diff() is called. Thanks to Kiran Varaganti for
      reporting this issue.

commit f0c3ef359f7c6c1687fb2671cb35deb346e00597
Author: Kiran V <Kiran.Varaganti@amd.com>
Date:   Thu Oct 4 16:32:21 2018 +0530

    This is a fix to floating-point exception error for BLIS SGEMM with larger matrix sizes.
    BUG No: CPUPL-197 fixed by Thangaraj Santanu
    The bli_clock_min_diff() function in BLIS assumed that if the time taken is greater than 1 hour then the reading must be wrong. However this is not the case in general, while the other checks such as time taken closer to zero or nsec is ofcourse valid.
    gerrit review: http://git.amd.com:8080/#/c/118694/1/frame/base/bli_clock.c
    
    Change-Id: I9dc313d7c5fdc20684f67a516bf3237de3e0694a

commit 8bf30eb4735872388b5317883d99b775a344ce25
Author: Devangi N. Parikh <dnp@cs.utexas.edu>
Date:   Wed Oct 3 22:22:29 2018 -0400

    Fixed runme.sh in test/studies/thunderx2
    
    Details:
    - Fixed the setting of threads for a single core run.

commit f6f2456ba2afa8f85f43c7c2c90acc439d61d94f
Author: Devangi N. Parikh <dnp@cs.utexas.edu>
Date:   Wed Oct 3 21:43:46 2018 -0400

    Fixed the Makefile in test/studies/thunderx2
    
    Details:
    - Fixed target for make-all-st and make-all-mt so that the armpl
      targets are built

commit 743a1a6dec1bd3908f0f15513b501c9bd59715b3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 3 14:40:10 2018 -0500

    Fixed misleading version query from gcc 7+.
    
    Details:
    - gcc 7 introduced new behavior to the -dumpversion option whereby only
      the major version component is output. However, as part of this
      change, gcc 7 also introduced a new option, -dumpfullversion, which is
      guaranteed to always output the major, minor, and revision numbers. If
      we are using gcc 7 or later, we re-query the version string with this
      new option and then re-parse the result so as to avoid misleading
      output from configure (e.g. using gcc 7.3.0 is reported as 7.7.7).

commit de07840ba5672b9d7b2ed2b918974e98c3f249fb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 3 13:57:25 2018 -0500

    Whitespace, https updates to README.md.
    
    Details:
    - Reformatted to fit all lines within 80 columns, unless a link is too
      long to fit on a single line.
    - Changed some links from http to https.

commit 80a8b3dd8034ec8bc03d31be3f9c837c3f6fc94b
Author: sraut <Biplab.Raut@amd.com>
Date:   Wed Oct 3 15:30:33 2018 +0530

    Review comments incorporated for small TRSM.
    
    Change-Id: Ia64b7b2c0375cc501c2cb0be8a1af93111808cd9

commit b8dfd82e0d1afda4ee5436662d63515a59b2dee3
Author: Devin Matthews <damatthews@smu.edu>
Date:   Tue Oct 2 15:37:12 2018 -0500

    Get pthreads via blis.h in the test driver.

commit d0c0c20b7bd3ecf914b5910a50f618fb7d7aa355
Author: Devin Matthews <damatthews@smu.edu>
Date:   Tue Oct 2 15:16:00 2018 -0500

    There seems to be a problem with _POSIX_BARRIERS on Travis.

commit 0904d9e4df0c8a256ac35c491f14a587ebe9fca2
Author: Devin Matthews <damatthews@smu.edu>
Date:   Tue Oct 2 15:04:36 2018 -0500

    *Always* use Windows primitives instead of pthreads.

commit 998317d309934cd7129f8c818ea6e5f07534ebc8
Author: Devin Matthews <damatthews@smu.edu>
Date:   Tue Oct 2 14:43:24 2018 -0500

    Remove pthreads from appveyor build.

commit 627d0c5bfd4b7b149803587391c93b164c11ced5
Author: Devin Matthews <damatthews@smu.edu>
Date:   Tue Oct 2 14:40:55 2018 -0500

    Combine the alternative barrier implementation for macOS with the pthread wrapper for Windows. Also implement pthread_{create,join} for Windows.

commit 81d2c064a209df7eca7d6103696ca3a137a7f82e
Author: Devin Matthews <damatthews@smu.edu>
Date:   Tue Oct 2 11:46:36 2018 -0500

    Add wrapper for basic pthreads functionality (mutex, once) with MSVC.

commit d33f130ea621fca1dccb30631f454d237918eb04
Author: Devin Matthews <damatthews@smu.edu>
Date:   Tue Oct 2 11:45:43 2018 -0500

    Some configure changes:
    
    1) Allow environment variables to be set anywhere in the argument list.
    2) Allow any environment variable to be set.
    3) Allow LIBPHTREAD to be set to null without getting defaulted to -lpthread.

commit 9d5f1c4f3bf70c2c0ea84bfa326a0113ae2d176c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 1 17:39:26 2018 -0500

    Patch to avoid gcc warning in blastest/f2c/open.c.
    
    Details:
    - Use the modulo operator to limit the size of an integer that is given
      to sprintf(). This avoids a warning in some versions of gcc about the
      integer potentially overflowing the available space in the string into
      which the integer is being printed.

commit 0c3cd00ba76de607e807f8deb04b1a2ce18ea7a8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 1 16:18:25 2018 -0500

    More README.md updates.
    
    Details:
    - Replaced much of "Getting Started" section with a shortened version of
      the bullet list of documentation currently shown in the github wiki
      page. Thanks to Devangi Parikh for her feedback in this change.

commit 8eaf34bd23b30a1857a50d7142ee9811895f24bf
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 1 14:29:07 2018 -0500

    Very minor README.md update.

commit 599090e0eb41b2706fa1231fa7b90096f3281678
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 1 14:04:30 2018 -0500

    README.md update.
    
    Details:
    - Added language mentioning SHPC group to Introduction.

commit ee46fa3efb6e920fa6c3d0b0601007f5de31deb5
Author: sraut <Biplab.Raut@amd.com>
Date:   Mon Oct 1 16:30:30 2018 +0530

    Small TRSM optimization changes :- 1) single precision small trsm kernels for XAt=B case are further optimized for performance. 2) double precision small trsm kernels for AX=B and XAtB cases are implemented. 3) single precision small trsm kernels for AutX=B are implemented in intrinsics to improve the current performance.
    
    Change-Id: Ic9d67ae6d8522615257dde018903f049dcffa2cf

commit 08045a6c52b6e025652c5b18eb120c0f4e61cf6f
Author: sraut <Biplab.Raut@amd.com>
Date:   Mon Oct 1 15:38:23 2018 +0530

    Corrected the fix made for  blastest level-3 failure to check m,n,k non-zero condition in bli_gemm_small.c
    
    Change-Id: Idaf9f2327c3127b04a2738ae8a058b83d6c57934

commit ac18949a4b9613741b9ea8e5026d8083acef6fe4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Sep 30 18:54:56 2018 -0500

    Multithreading optimizations for l3 macrokernels.
    
    Details:
    - Adjusted the method by which micropanels are assigned to threads in
      the 2nd (jr) and 1st (ir) loops around the microkernel to (mostly)
      employ contiguous "slab" partitioning rather than interleaved (round
      robin) partitioning. The new partitioning schemes and related details
      for specific families of operations are listed below:
      - gemm: slab partitioning.
      - herk: slab partitioning for region corresponding to non-triangular
              region of C; round robin partitioning for triangular region.
      - trmm: slab partitioning for region corresponding to non-triangular
              region of B; round robin partitioning for triangular region.
              (NOTE: This affects both left- and right-side macrokernels:
              trmm_ll, trmm_lu, trmm_rl, trmm_ru.)
      - trsm: slab partitioning.
              (NOTE: This only affects only left-side macrokernels trsm_ll,
              trsm_lu; right-side macrokernels were not touched.)
      Also note that the previous macrokernels were preserved inside of
      the 'other' directory of each operation family directory (e.g.
      frame/3/gemm/other, frame/3/herk/other, etc).
    - Updated gemm macrokernel in sandbox/ref99 in light of above changes
      and fixed a stale function pointer type in blx_gemm_int.c
      (gemm_voft -> gemm_var_oft).
    - Added standalone test drivers in test/3m4m for herk, trmm, and trsm
      and minor changes to test/3m4m/Makefile.
    - Updated the arguments and definitions of bli_*_get_next_[ab]_upanel()
      and bli_trmm_?_?r_my_iter() macros defined in bli_l3_thrinfo.h.
    - Renamed bli_thread_get_range*() APIs to bli_thread_range*().

commit b952ca8feb6f17f71a4512649c2aa72bdee9c8f4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Sep 28 16:12:32 2018 -0500

    CREDITS file update.

commit 7d96fc437ebaa9dd2d7071865b5df16402fadd64
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Sep 28 15:40:45 2018 -0500

    Allow slashes ('/') in version tags.
    
    Details:
    - Updated the configure script to allow slashes in version string. This
      is needed so that downstream maintainers (such as those for Debian)
      can create local tags such as "upstream/0.4.1". Thanks to M. Zhou for
      reporting this issue via PR #256 and providing me the information
      needed to debug the problem.

commit 5fdddf6f37c64da093c7f59e3a85214e819ae652
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Sep 28 11:25:54 2018 -0500

    Removed 'debian' directory.
    
    Details:
    - Removed the top-level 'debian' directory. This directory is apparently
      no longer needed (issue #257). Thanks to M. Zhou and Nico Schlömer for
      their contributions.

commit 9814cfdf3157ef4726ee604fc895d56e8063d765
Author: Meghana <meghana.vankadari@amd.com>
Date:   Fri Sep 28 11:02:39 2018 +0530

    fixed blastest level-3 failure by adding ((M&N&K) != 0) to check condition in bli_gemm_small.c
    
    Change-Id: I85e4a32996ebb880f3c00bd293edc38f74700fe6

commit 86330953b14c180862deef3ccdcc6431259be27b
Merge: 7af5283d 807a6548
Author: praveeng <praveen.g@amd.com>
Date:   Fri Sep 28 10:08:06 2018 +0530

    Resolved conflicts and modified bli_trsm_small.c
    
    Change-Id: I578d419cff658003e0fdd4c4cdc93145d951ce31

commit 60b2650d7406d266feffe232c2d5692a9e3886d0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Sep 24 15:04:45 2018 -0500

    Added statistics-collecting irun.py script.
    
    Details:
    - Added irun.py script to 'build' directory. This irun.py script is a
      python script for repeatedly invoking a test driver executable, such
      as those found in test/3m4m, and replace the performance output column
      with four columns that aggregate statistics. Specifically, the script
      reports the minimum, average, maximum, and standard deviation for each
      problem size. This script is useful especially (though not
      exclusively) when trying to determine the impact of relatively minor
      changes to the code, or other small optimizations that may be
      difficult to distinguish from "noise." One way this "noise" manifests
      is that a test executable may run slightly slower or faster for all
      problem sizes (and all implementations) tested by the executable over
      the life of a single execution. The cause of these minor
      across-the-board pertubations in the overall performance signatures is
      unknown, though we hypothesize that it may relate to any number of
      issues such as operating system scheduling, where in memory the
      program is loaded, or how the CPU clock frequency is throttled at the
      time of execution. Regardless of the source of these subtle
      performance anomalies, the statistical properties reported by the
      irun.py script help the user to more precisely characterize the
      underlying performance exhibited by any given test driver, which
      allows him or her to make better judgments about the true difference
      in performance between two implementations, or minor changes within a
      single implementation.

commit 807a654888117fb3a27ea36384f1c1c11b882cd5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Sep 20 15:41:05 2018 -0500

    Fixed confusing configure message for libmemkind.
    
    Details:
    - Corrected feedback echoed to user by configure when libmemkind is
      found but not explicitly requested. In these cases, configure would
      echo a message that it had received an explicit request to enable
      libmemkind, which was not accurate, even if the end result was the
      same--that libmemkind is enabled by default when it is found. Thanks
      To Devangi Parikh for reporting this issue.

commit 02adab427c779b0aaf38a5877a5f0246b1909e8f
Author: Devangi N. Parikh <dnp@cs.utexas.edu>
Date:   Thu Sep 20 14:38:50 2018 -0400

    Created a 'thunderx2' subdirectory within test/studies
    
    Details:
    - Created a 'thunderx2' subdirectory within test/studies to house
      various level-3 test driver used to measure performance on
      ThunderX2.

commit d7537fb51dac0636591fc7c68261a2322642ab3c
Merge: dad07245 c03728f1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Sep 12 15:24:20 2018 -0500

    Merge branch 'dev'

commit dad07245dbcfaf35232ec379ba756eb133c361c1
Author: Devangi N. Parikh <dnp@cs.utexas.edu>
Date:   Wed Sep 12 04:16:58 2018 -0500

    Fixed yet another bug in runme script in test/studies
    
    Details:
    - Fixed another copy-paste bug

commit e669057fe35f2037d8111af687d84a0ecf6d7a2a
Author: Devangi N. Parikh <dnp@cs.utexas.edu>
Date:   Tue Sep 11 22:29:42 2018 -0500

    Fixed bug in runme script in test/studies
    
    Details:
    - Fixed bug in runme script for skx studies that set the number of
      threads incorrectly

commit 232fdc3df3e01ae3f86d53767bd14eb93b511e6e
Author: Devangi N. Parikh <dnp@cs.utexas.edu>
Date:   Mon Sep 10 18:45:50 2018 -0500

    Updated runme script in test/studies.
    
    Details:
    - Updated runme script for skx studies to run multithreading tests
      on 1 and 2 sockets.

commit c03728f1f45edb5e434db90ab8a77ba0184a682b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Sep 10 17:54:27 2018 -0500

    Various minor cleanups.
    
    Details:
    - Rewrote bli_winsys.c to define bli_setenv() and bli_sleep()
      unconditionally, but differently for Windows and non-Windows, but
      then disabled the definition of bli_setenv() entirely since BLIS
      no longer needs to set environment variables. Updated bli_winsys.h
      accordingly, and call bli_sleep() from within testsuite instead of
      sleep() directly.
    - Use
        #if !defined(_POSIX_BARRIERS) || (_POSIX_BARRIERS != 200809L)
      instead of
        #if !defined(_POSIX_BARRIERS) || (_POSIX_BARRIERS < 0)
      when guarding against local definition of pthread barrier in
      testsuite. (The description for unistd.h implies that _POSIX_BARRIERS
      should always be set to 200809L when barriers are supported, though I
      won't be surprised if we encounter a case in the future where it is
      set to something else such as 1 while still supported.)
    - Removed old _VERS_CONF_INST definitions and installation rules in
      top-level Makefile. These are no longer needed because we no longer
      output libraries with the version and configuration name as
      substrings.
    - Comment/whitespace updates in Makefile, config.mk.in, common.mk,
      configure, bli_extern_defs.h, and test_libblis.h.
    - Added mention of 1m to README.md and other trivial tweaks.

commit e249a00a82908054ecd307cf602c8801275903e8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Sep 10 16:48:35 2018 -0500

    Imported skx dgemm ukernel from skx-redux branch.
    
    Details:
    - Added the new bli_dgemm_skx_asm_16x14.c microkernel from the skx-redux
      branch, along with appropriate blocksizes in bli_cntx_init_skx.c and
      a prototype in bli_kernels_skx.h. (Devin has not yet written the
      sgemm analague, so for now we will continue using the older sgemm
      ukernel.)
    - Updated frame/include/bli_x86_asm_macros.h with a minor change that
      was present within the skx-redux branch.

commit e93b01ff60bf9742baa5eefd93e208d1219e7a43
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Sun Sep 9 15:57:43 2018 -0500

    Windows DLL support (#246)
    
    * Enable shared
    
    * Enable rdp
    
    * Add support for dll
    
    * Use libblis-symbols.def
    
    * Fix building dlls
    
    * Fix libblis-symbols.def
    
    * Fix soname
    
    * Fix Makefile error
    
    * Fix install target
    
    * Fix missing symbols
    
    * Add BLIS_MINUS_TWO
    
    * Add path to dll
    
    * Fix OSX soname
    
    * Add declspec for dll
    
    * Add -DBLIS_BUILD_DLL
    
    * Replace @enable_shared@ in config
    
    * switch to auto for now
    
    * blis_ -> bli_
    
    * Remove BLIS_BUILD_DLL in make check
    
    * change auto->haswell
    
    * enable_shared_01
    
    * Add wno-macro-redefined
    
    * print out.cblat3
    
    * BLIS_BUILD_DLL -> BLIS_IS_BUILDING_LIBRARY
    
    * Use V=1
    
    * Remove fpic for windows
    
    * Remember LIBPTHREAD
    
    * Remove libm for windows
    
    * Remember AR
    
    * Fix remembering libpthread
    
    * Add Wno-maybe-uninitialized in only gcc
    
    * Don't do blastest for shared for now
    
    * Fix install target
    
    And remove unnecessary change
    
    * test auto and x86_64
    
    * Fix install target again
    
    * Use IS_WIN variable
    
    * Remove leading dot from LIBBLIS_SO_MAJ_EXT
    
    * Make is_win yes/no
    
    * Add comments for windows builds
    
    * Change if else blocks location

commit 1330d5c4bc3b644ec0af54c3939a5b9f00eacd9c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Sep 7 19:37:59 2018 -0500

    Employ "user" cflags for tl Makefile test targets.
    
    Details:
    - Use get-user-cflags-for() to generate cflags when compiling BLAS test
      drivers and BLIS testsuite from top-level Makefile. Meant to include
      these changes in previous commit (4b5437e). Thanks to Isuru Fernando
      for pointing out this oversight.

commit 4b5437ec7afb2befffffbb83f7872bcb4fc61e51
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Sep 7 17:24:32 2018 -0500

    Define a cpp macro specific to BLIS compilation.
    
    Details:
    - Tweaked the cflags functions in common.mk so that a new preprocessor
      macro, BLIS_IS_BUILDING_LIBRARY, is defined, but only when BLIS
      itself is being built. This macro will not be defined when, for
      example, the testsuite or example code compiles code local to those
      applications. This was done in part by defining a new cflags function
      get-user-cflags-for(), which is now the designated function for
      application Makefiles if they wish to inherit a basic set of CFLAGS
      from BLIS. (The compiler flags returned are identical to that of
      get-frame-cflags-for() except that -DBLIS_IS_BUILDING_LIBRARY is
      omitted.)
    - Updated all test driver-like makefiles to call get-user-cflags-for()
      instead of get-frame-cflags-for().

commit cc2cca4f56eb30212a0dce3e5c121e64d9e59560
Merge: e19e7212 fb81c7fc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Sep 6 17:12:13 2018 -0500

    Merge branch 'dev'

commit e19e7212872da3d464734199193436faa51f0da0
Merge: 97965b09 b3d0702c
Author: Jeff Hammond <jeff.science@gmail.com>
Date:   Thu Sep 6 14:58:49 2018 -0700

    Merge pull request #244 from kali/pthread-barrier-osx
    
    add an adhoc impl for pthread_barrier

commit b3d0702cf2ef6dda19a23dd8a677be1b6f73c322
Merge: 4e7d0670 97965b09
Author: Jeff Hammond <jeff.science@gmail.com>
Date:   Thu Sep 6 14:58:23 2018 -0700

    Merge branch 'master' into pthread-barrier-osx

commit 4e7d06700f176a62952d7d51e41fdcbc6b7a9d5f
Author: Mathieu Poumeyrol <kali@zoy.org>
Date:   Thu Sep 6 23:48:31 2018 +0200

    second __APPLE__

commit fb81c7fc665d68e6a2add163feb29acc0bce8936
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Sep 6 16:29:39 2018 -0500

    Defined cortexa53 sub-configuration.
    
    Details:
    - Added a new sub-configuration 'cortexa53', which is a mirror image
      of cortexa57 except that it will use slightly different compiler
      flags. Thanks to Mathieu Poumeyrol for making this suggestion after
      discovering that the compiler flags being used by cortexa57 were
      not working properly in certain OS X environments (the fix to which
      is currently pending in pull request #245).

commit 24ecc0d94aaa9ab4df1ae6d199c4ec6d7783169f
Author: Mathieu Poumeyrol <kali@zoy.org>
Date:   Thu Sep 6 22:10:16 2018 +0200

    use _POSIX_BARRIERS instead of __APPLE__

commit 97965b09059a610db06fb7a22bdfa79c0d37d673
Author: Mathieu Poumeyrol <kali@users.noreply.github.com>
Date:   Thu Sep 6 21:10:29 2018 +0200

    cortexa9 and cortexa53 travis build + qemu test (#245)

commit a6802eab7d94b5a9de633c53beca8245b74f5dc6
Author: Mathieu Poumeyrol <kali@zoy.org>
Date:   Thu Sep 6 17:16:35 2018 +0200

    reinstantiate test on macos

commit d688a2b7e5a19cba44ea398a99e325e19b8fce50
Author: Mathieu Poumeyrol <kali@zoy.org>
Date:   Thu Sep 6 15:25:16 2018 +0200

    add an adhoc impl for pthread_barrier

commit ab9f9e684dc3ffbb70cc45b21c67af5d916919e5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 30 15:14:02 2018 -0500

    CHANGELOG update (0.4.1)

commit 10fd614031307c46db3d893528d4e5fc31f490b3 (tag: 0.4.1)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 30 15:13:59 2018 -0500

    Version file update (0.4.1)

commit 08dd67c4b21244851f8416bd59159bea7a9c5b3d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 30 15:12:13 2018 -0500

    ReleaseNotes.md update in advance of next version.

commit 4fa4cb0734e7de6505b5d6f1aeef3a5d5c89dcbb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Aug 29 18:06:41 2018 -0500

    Trivial comment header updates.
    
    Details:
    - Removed four trailing spaces after "BLIS" that occurs in most files'
      commented-out license headers.
    - Added UT copyright lines to some files. (These files previously had
      only AMD copyright lines but were contributed to by both UT and AMD.)
    - In some files' copyright lines, expanded 'The University of Texas' to
      'The University of Texas at Austin'.
    - Fixed various typos/misspellings in some license headers.

commit b051ffb815baf6c3ece2b5118b679fd9219d5780
Merge: 6f33d9de aaa549f4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Aug 29 17:06:48 2018 -0500

    Merge branch 'dev'

commit 6f33d9de21fbc2f579846b9104fb9d513753f79c
Author: Mathieu Poumeyrol <kali@users.noreply.github.com>
Date:   Wed Aug 29 23:48:22 2018 +0200

    fix compilation of armv7a kernels (#242)

commit 8199e339aefdd27019c7f3d8c99818d375d5400b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Aug 27 07:00:12 2018 -0500

    Added testsuite threading to input.general.fast.
    
    Details:
    - Added lines associated with the testsuite's new threading option to
      input.general.fast. This change was intended for the previous commit
      (10d0735).

commit 10d07357afbb2d468837aa97369ef9a6d0610817
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Aug 26 20:34:30 2018 -0500

    Better thread safety; added threading to testsuite.
    
    Details:
    - Replaced critical sections that were conditional upon multithreading
      being enabled (via pthreads or OpenMP) with unconditional use of
      pthreads mutexes. (Why pthreads? Because BLIS already requires it
      for its initialization mechanism: pthread_once().) This was done in
      bli_error.c, bli_gks.c, bli_l3_ind.c. Also, replaced usage of BLIS's
      mtx_t object and bli_mutex_*() API with pthread mutexes in
      bli_thread.c. The previous status quo could result in a race condition
      if the application called BLIS from more than one thread. The new
      pthread-based code should be completely agnostic to the application's
      threading configuration. Thanks to AMD for bringing to our attention
      the need for a thread-safety review.
    - Added an option to the testsuite to simulate application-level
      multithreading. Specifically, each thread maintains a counter that is
      incremented after each experiment. The thread only executes the
      experiment if: counter % n_threads == thread_id. In other words, the
      threads simply take turns executing each problem experiment. Also,
      POSIX guarantees that fprintf() will not intermingle output, so
      output was switched to fprintf() instead of libblis_test_fprintf().
    - Changed membrk_t objects to use pthread_mutex_t intead of mtx_t and
      replaced use of bli_mutex_init()/_finalize() in bli_membrk.c with
      wrappers to pthread_mutex_init()/_destroy().
    - Changed the implementation of bli_l3_ind_oper_enable_only() to fix
      a race condition; specifically, two threads calling the function with
      the same parameters could lead to a non-deterministic outcome.
    - Added #include <pthread.h> to bli_cpuid.c and moved the same in
      bli_arch.c.
    - Added 'const' to declaration of OPT_MARKER in bli_getopt.c.
    - Added #include <pthread.h> to bli_system.h.
    - Added add-copyright.py script to automate adding new copyright lines
      to (and updating existing lines of) source files.

commit aaa549f4d1e63929fe2bea023ce849253cfbbb42
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Aug 26 20:13:51 2018 -0500

    Minor update to configure --help (--sharedir option).
    
    Details:
    - Fixed/tweaked description for --sharedir=SHAREDIR option.

commit 573b8ac373f821a65cc8afd51cdbe03b8ec01081
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Aug 26 13:51:32 2018 -0500

    Fixed copy-paste typo in previous commit.
    
    Details:
    - Fixed a typo in travis/do_testsuite.sh introduced in 62ea1d3.

commit 62ea1d33d3bc1e890420a1e828b9d0e87e87533b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Aug 26 13:35:53 2018 -0500

    Fixed broken out-of-tree builds.
    
    Details:
    - Fixed stale filepaths to check-blastest.sh and check-blistest.sh in
      travis/do_testsuite.sh and travis/do_sde.sh.
    - Create a symbolic link to the 'config' directory so that the top-level
      Makefile can find the configs' make_defs.mk files during out-of-tree
      builds.
    - Added additional case handling to out-of-tree scenario to handle
      situations where files 'Makefile', 'common.mk', or 'config' exist but
      are not symbolic links. In such cases, configure warns the user and
      exits.
    - Homogenized various error messages throughout configure.
    - Belated thanks to Victor Eijkhout for requesting the feature added
      in 0f491e9 whereby lesser Makefiles can compile and link against
      an existing installation of BLIS.

commit 0f491e994a7e14d4dfce26e6a51dba2bccad29a3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Aug 25 20:12:36 2018 -0500

    Allow lesser Makefiles to reference installed BLIS.
    
    Details:
    - Updated the build system so that "lesser" Makefiles, such as those in
      belonging to example code or the testsuite, may be run even if the
      directory is orphaned from the original build tree. This allows a
      user to configure, compile, and install BLIS, delete the build tree
      (that is, the source distribution, or the build directory for out-
      of-tree builds) and then compile example or testsuite code and link
      against the installed copy of BLIS (provided the example or testsuite
      directory was preserved or obtained from another source). The only
      requirement is that make be invoked while setting the
      BLIS_INSTALL_PATH variable to the same installation prefix used when
      BLIS was configured. The easiest syntax is:
    
        make BLIS_INSTALL_PATH=/install/prefix
    
      though it's also permissible to set BLIS_INSTALL_PATH as an
      environment variable prior to running 'make'.
    - Updated all lesser Makefiles to implement the new aforementioned build
      behavior.
    - Relocated check-blastest.sh and check-blistest.sh from build to
      blastest and testsuite, respectively, so that if those directories are
      copied elsewhere the user can still run 'make check' locally.
    - Updated docs/Testsuite.md with language that mentions this new option
      of building/linking against an installed copy of BLIS.

commit 36ff92ce0d3b428b15b6cddc6f5944afe22e43ec
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Aug 24 18:26:09 2018 -0500

    Missing C++ compiler no longer fatal to configure.
    
    Details:
    - Changed configure so that the absence of any C++ compiler from the
      pre-defined search list does not result in an exit. Instead, in this
      situation, the found_cxx variable is assigned 'c++notfound' and the
      error message is changed to remind the user that C++ will not be
      available in the sandbox. Thanks to Devangi Parikh for reporting this
      issue.
    - Also tweaked the message when a C++ compiler *is* found to remind any
      would-be confused user that BLIS will only use C++ if it is needed by
      code in the sandbox.

commit 658f0a129bdc565b072696b6ebddce501132091c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Aug 24 17:49:37 2018 -0500

    Fixed obscure integer size bug in va_arg() usage.
    
    Details:
    - Fixed a bug in the way that the variadic bli_cntx_set_l3_nat_ukrs()
      function was defined. This function is meant to take a microkernel id,
      microkernel datatype, microkernel address, and microkernel preference
      as arguments, and is typically called within the bli_cntx_init_*()
      function defined within a sub-configuration for initializing an
      appropriate context. The problem is with the final argument: the
      microkernel preference. These preferences are actually boolean values,
      0 or 1 (encoded as FALSE or TRUE). Since the variadic function does
      not give the compiler any type information for any variadic arguments,
      they are "promoted" in the course of internal (macroized) processing
      according to default argument promotion rules. Thus, integer literals
      such as 0 and 1 become int and floating-point literals (such as 0.0 or
      1.0) become double. Previous to this commit, we indicated to va_arg()
      that the ukernel preference was a 'bool_t', which is a typedef of
      int64_t on 64-bit systems. On systems where int is defined as 64 bits,
      no problems manifest since int is the same size as the type we passed
      in to va_arg(), but on systems where int is 32 bits, the ukernel
      preference could be misinterpreted as a garbage value. (This was
      observed on a modern armv8 system.) The fix was to interpret the
      bool_t value as int and then immediately typecast it to and store it
      as a bool_t. Special thanks to Devangi Parikh for helping track down
      this issue, including deciphering the use of va_arg() and its
      byzantine treatment of types.
    - Added explicit typecasts for all invocations of va_arg() in
      bli_cntx.c.

commit e71dc389120b032e42091e4d1a928515ed6f7275
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Aug 24 15:56:04 2018 -0500

    Fixed a very minor memory leak in gks.
    
    Details:
    - Fixed a memory leak in the global kernel structure that resulted in 56
      bytes per configured architecture (of which only 18 are presently
      supported by BLIS). The leak would only manifest if BLIS was
      initialized and then finalized before the application terminated.
      Thanks to Devangi Parikh for helping track down this leak.

commit a7e3a5f9753468c8e665e6c5c3b38d22b7c92500
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Aug 24 14:51:11 2018 -0500

    Fixed uncallable bli_finalize().
    
    Details:
    - Previously, bli_finalize_once()--which, like bli_init_once(), was
      implemented in terms of pthread_once()--was using the same
      pthread_once_t control object being used by bli_init(), thus
      guaranteeing that it would never be called as long as BLIS had already
      been initialized. This could manifest as a rather large memory leak to
      any application that attempted to finalize BLIS midway through its
      execution (since BLIS reserves several megabytes of storage for
      packing buffers per thread used). The fix entailed giving each
      function its own pthread_once_t object. Thanks to Devangi Parikh for
      helping track down this very quiet bug.

commit a79c21c7c17fb4854fd24c73b81ec5543f74082d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 23 14:40:46 2018 -0500

    Fixed cleanmk target post-1b0f8d6.
    
    Details:
    - Changed the cleanmk target to delete makefile fragments from their new
      home in obj/$(CONFIG_NAME). The old definition worked only because of
      a typo (REFERKN_PATH instead of REFKERN_PATH), and only in the
      non-verbose (V != 1) case.

commit ffb57242f3eb1175c991fe1b492595fdaa175c27
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Aug 22 18:22:41 2018 -0500

    Cosmetic output changes to configure.
    
    Details:
    - Disable sandbox-related obj directory creation, directory mirroring,
      and makefile fragment generation when a sandbox is not enabled.
    - Prevent various duplicate actions by configure (such as those
      mentioned above for sandboxes above).

commit ac17454aae9ad430f05aa7c156919c6c695c300c
Merge: a77bec76 7afd095a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Aug 22 15:34:53 2018 -0500

    Merge branch 'master' into dev

commit a77bec766a01e42f13f8cacbec8c4cbde8ecefef
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Aug 22 15:31:29 2018 -0500

    Whitespace changes, minor renames in build system.
    
    Details:
    - Minor whitespace cleanup, mostly in the form of spaces -> tabs.
    - Shortened certain variables' _FRAGMENT_ infixes to _FRAG_ in
      common.mk.

commit 1b0f8d60d1132b56485cc202ebf1246898d3a2a4
Author: Devin Matthews <damatthews@smu.edu>
Date:   Wed Aug 22 13:19:29 2018 -0700

    Generate makefile fragments in build tree (#240)
    
    * Make src dir read-only in out-of-tree build test.
    
    * Generate makefile fragments in the build tree.

commit 7afd095af33690e0175903852b354c9fe46993f6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Aug 22 14:58:24 2018 -0500

    Removed skx from code snippet in previous commit.
    
    Details:
    - The docs/ConfigurationHowTo.md document was written with examples that
      did not yet contain the skx sub-configuration, but the previous commit
      included bli_arch.c code copied and pasted from a recent commit that
      does support skx. To keep things consistent, I've removed skx from the
      recently-added ConfigurationHowTo.md code snippet.

commit 48211a980d78673133076e8eced1007b1980f5e6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Aug 22 14:55:02 2018 -0500

    Update to docs/ConfigurationHowTo.md.
    
    Details:
    - Added missing language directing the reader to modify the config_name
      string array in bli_arch.c when adding a new sub-configuration. Thanks
      to Devangi Parikh for reporting this missing section.

commit 65c9096c6e21f3dc2947fa12be9ea3034f8662dc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Aug 17 11:44:12 2018 -0500

    Fixed broken -p option to configure.
    
    Details:
    - Fixed some stale code that was preventing the -p option to configure
      from working as expected (though the --prefix option was unaffected).
      This bug was was most likely introduced in  7e5648c (May 7 2018).
      Thanks to Dave Love for reporting this issue.

commit e358d5e497c77b305af462f44266370a596445e2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 16 12:18:45 2018 -0500

    README.md update (Funding section).

commit a61dd5e7bcf23f7237d407a5e06dd44e1bec9ad0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Aug 14 17:08:03 2018 -0500

    Changed 'test' target to be more like 'check'.
    
    Details:
    - Redefined the 'test' make target in the top-level Makefile so that the
      final result ("everything passed" or at "least one failure") is echoed
      to stdout. Note that 'check' is unchanged, and thus is now effectively
      a fast version of 'test'.
    - Updated docs/BuildSystem.md to reflect the above change.

commit ce5c3a198a7ae1ca676c27da4541d51ed19d16e1
Merge: 4f6745d6 0bbe69d5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Aug 14 16:52:19 2018 -0500

    Merge branch 'master' of github.com:flame/blis

commit 4f6745d68a2c66511695eff0beb00a82ffc6bbbe
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Aug 14 16:50:47 2018 -0500

    Fixed link error when building only shared library.
    
    Details:
    - Fixed a linker error that occurred when attempting to compile and link
      the testsuite and/or BLAS test drivers after having configured BLIS to
      only generate a shared library (no static library). The chosen
      solution involved
      (1) adding the local library path, $(BASE_LIB_PATH), to the search
          paths for the shared library via the link option
          -Wl,-rpath,$(BASE_LIB_PATH).
      (2) adding a local symlink to $(BASE_LIB_PATH) that uses the .so major
          version number so that ld would find the shared library at
          execution time.
      Thanks to Sajid Ali for reporting this issue, to Devin Matthews for
      pointing out the need for the -rpath option, and to Devangi Parikh for
      helping Sajid isolate the problem.
    - Added #include <ctype.h> to bli_system.h to avoid a compiler warning
      resulting from using toupper() from bli_string.c without a prototype.
      Thanks again to Sajid Ali, whose build log revealed this compiler
      warning.
    - Added '*.so.*' to .gitignore.
    - CREDITS file update.

commit 0bbe69d5ed260849297d8f2d35b7668d167482ed
Author: Devangi N. Parikh <dnp@cs.utexas.edu>
Date:   Tue Aug 14 14:49:58 2018 -0500

    Updated plotting scripts in test/studies.
    
    Details:
    - Fixed indexing on plots to correspond to the removal of dtime in
      the test drivers.

commit e93e0e149e087e08eca2885f1a748a4e88ffe55d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Aug 7 15:54:30 2018 -0500

    Removed redefinition of axpyv, scal2v func types.
    
    Details:
    - Removed a stray/accidental redefinition of axpyv and scal2v function
      types in frame/1d/bli_l1d_ft.h (probably a copy/paste leftover during
      development).

commit 1deb33bd16349aaa643694d1bd685ff8a9a5f476
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Aug 7 15:02:50 2018 -0500

    Updated penryn kernels to use new _ker_ft type names.
    
    Details:
    - Updated older _ft kernel type suffixes used within penryn level-1v
      and -1f kernels to use the newer _ker_ft suffix that was introduced
      in 0175483. (Thank you Travis CI.)

commit 9cb0b023ca91abdc056d726cdc070062e4954611
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Aug 7 14:21:07 2018 -0500

    INSTALL file update.

commit 017548314f3f78f66fbe3264509ac5302bd8d62b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Aug 7 14:13:25 2018 -0500

    Replaced function chooser macros w/ func ptr arrays.
    
    Details:
    - Previously, most object API functions (_oapi.c) used a function
      chooser macro that would expand out to an if-elseif-elseif-else
      conditional that used a num_t datatype to call the appropriate
      type-specific API (_tapi.c). This always felt a little hackish, and
      would get in the way somewhat of addig support for new num_t datatypes
      in the future. So, I've replaced that functionality with code that
      queries a function pointer that is then typecast appropriately. This
      model of function calling was already pervasive for kernels queried
      from the cntx_t structure. It was also already in use in various other
      functions, such as macrokernels, and this commit simply extends that
      pattern.
    - The above change required many new files, mostly header files, that
      define the function types (mostly _ft.h) for the queriable functions
      as well as some source files to define the function pointer arrays and
      their corresponding query functions (_fpa.c). Various other function
      types, mostly for kernel function types, were renamed to reduce the
      potential for confusion with the function types for expert and basic
      (non-expert) typed API functions.
    - Removed definitions for all of the "bli_call_ft_*()" function chooser
      macros from bli_misc_macro_defs.h.

commit addce089664561f9f63efa6f107e58fc48d29871
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Aug 6 13:18:20 2018 -0500

    Format spec and other updates in test, test/3m4m.
    
    Details:
    - Removed the dtime (delta time, or wallclock time) column from the
      matlab output of all test drivers in test, test/3m4m, test/studies.
      This value was rarely (if ever) really needed and usually only served
      to take up screen space.
    - Updated format specifier in test/studies/skx to use %7.2f instead of
      %6.3f.
    - For the test drivers in 'test' directory, added an initial line of
      output that sets last entry of matlab matrix to zero in order to
      induce a pre-allocation of the entire array of performance results.

commit 94d5ef42c833a4d43e50a80d46dddbd7a56d2db6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Aug 4 15:57:17 2018 -0500

    Adjusted gflops format spec in testsuite, test/3m4m.
    
    Details:
    - Changed the format specifier for the gflops column in the testsuite
      output from %7.3f to %7.2f. This was done mainly to keep the output
      aligned properly when the expected perfomance exceeded 1000 gflops.
      Also, two decimal places still conveys plenty of precision for all
      practical applications, including just eyeballing performance deltas
      between two executions (let alone two implementations).
    - Changed the format specifier for gflops in the test/3m4m drivers
      from %6.3f to %7.2f (for the same reasons listed above).

commit c7ff06bae92b9b6c6656f2030d13486b95417821
Merge: 6074082c ebe998d0
Author: Devangi N. Parikh <dnp@cs.utexas.edu>
Date:   Wed Aug 1 14:20:41 2018 -0500

    Merge branch 'master' of https://github.com/flame/blis

commit 6074082cd359dd775ef72478f8f3a281c5a6a6f9
Author: Devangi N. Parikh <dnp@cs.utexas.edu>
Date:   Wed Aug 1 13:30:51 2018 -0500

    Fixed bug in bli_cntx_set_packm_ker_dt() implementation.
    
    Details:
    - Fixed bug in static function bli_cntx_set_[packm/unpackm]_ker_dt(), which
       were incorrectly calling bli_cntx_get_[packm/unpackm]_ker_dt to get the
       corresponding func_t.

commit ebe998d06cc56a9a9d66990b6ebf683d6fd0efdf
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Aug 1 13:24:00 2018 -0500

    Fixed typos in BuildSystem.md from previuos commit.

commit e72a344e94c5ae253f69b60f41d92ca89a5d1d1c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Aug 1 13:00:38 2018 -0500

    Added table of 'make' targets to BuildSystem.md.
    
    Details:
    - Added a new section to BuildSystem.md that describes the most useful
      make targets defined in the top-level Makefile.

commit 4f60d0288e00586dc921ff57db851f1266ff8e70
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jul 30 19:22:57 2018 -0500

    README.md, comment updates.
    
    Details:
    - Added links, and sandbox language to README.md.
    - Adjusted some comments in high-level level-3 object functions to make
      clear what bli_thread_init_rntm() does.

commit 455d3f49e5c8362395be14c79e6adb5123e29623
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Jul 29 18:31:29 2018 -0500

    Edits to object/typed API, multithreading docs.

commit 922a1c05e06f52c97fb369870dce07233e61c4c9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jul 28 20:15:55 2018 -0500

    More tweaks to README.md.

commit a7a0cf2b5d9f1dea5061c0f20eeaf371dfd4ea12
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jul 28 16:59:31 2018 -0500

    More edits to docs/Multithreading.md.

commit be21d0cf68c330fd0d2048465a43ddc59d0b9d6c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jul 28 16:46:51 2018 -0500

    Fixed typos in docs/Multithreading.md.

commit eac07c7b4f7a41c68d63f1e67141b2b58009609e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jul 28 16:45:28 2018 -0500

    Edits to docs/Multithreading.md.

commit 5438375a032273b46ae626fee909ffc05f48ab72
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jul 28 16:34:21 2018 -0500

    Fixed link in README.md.

commit 1f1a237d3f0b24d71ce2d7ee52d8a84f8e6a29ad
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jul 28 16:33:28 2018 -0500

    Fixed links in BLISTypedAPI.md.

commit 89c8806e3aa49310f36c0314c5f6956c83a627a1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jul 28 16:30:56 2018 -0500

    Minor doc fixes to previous commit.

commit b8c7574f84873b9c408f70c29c41ce464df57c2d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jul 28 16:27:09 2018 -0500

    README.md, typed/object API updates.
    
    Details:
    - Updated the typed and object APIs to include language on the rntm_t
      parameters in the expert interfaces.
    - Updated README to include link to object API.

commit 29c34c4adb02d91fb34d1ccc0e821d6cfb7ce5c5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jul 27 16:26:19 2018 -0500

    CREDITS file update.

commit 55a04edf52ac4f16c51b738bc884684adc1f1777
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jul 27 16:10:46 2018 -0500

    CHANGELOG update (0.4.0)

commit 4ad61ce905d250dd3ef197f0d06a69ce6d99d309 (tag: 0.4.0)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jul 27 16:10:43 2018 -0500

    Version file update (0.4.0)

commit b86cf13793b07f35c027a56c9faec8f4b6279d3e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jul 27 16:08:21 2018 -0500

    Release Notes update in advance of next version.

commit a8b4084a0e04e47ac02ceae93a2018f5363e1205
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jul 27 16:07:26 2018 -0500

    CREDITS file update.

commit 8e10cac5f388ac961c3d77b0a465214e7c9dc91a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jul 27 14:45:35 2018 -0500

    Updates to CREDITS, RELEASING, config/README.md.
    
    Details:
    - Added individuals' github handles to CREDITS file.
    - Updated RELEASING, config/README.md files.

commit 401b69c8f26a86726ac5e1fb4f9fc2d2098ef204
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 25 17:55:13 2018 -0500

    More indentation in docs/ConfigurationHowTo.md.

commit 1c6a1b921ef96999bb449d657cca6d9a556f7245
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 25 17:14:58 2018 -0500

    Trying new indentation in ConfigurationHowTo.md.
    
    Details:
    - Modified a few sections to take advantage of a feature of markdown
      that allows a bullet or enumeration to have multiple paragraphs. This
      is a trial run to make sure the indentation looks good when rendered
      in a web browser.

commit 71f978719527fcf17617cb234e48bf349a76c12d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 25 15:55:36 2018 -0500

    Whitespace changes to macrokernels' func ptr defs.

commit 87d57c31c2bfcf4609dfe31ce915e9345150e613
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 25 14:20:18 2018 -0500

    Various minor updates to typed, object API docs.

commit fb6e16268aaafbab2fd78d47cbf821e2152261fd
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 25 14:17:28 2018 -0500

    Consolidated prototypes in bli_l1v_tapi.h.
    
    Details:
    - Consolidated typed API function prototypes in bli_l1v_tapi.h by
      leveraging identical function signatures between operations.
    - Removed 'restrict' keyword since it is not actually present in the
      function definitions.

commit af60d738f21340ccb0903e6c87dbf6af4fc44fc0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 24 15:35:52 2018 -0500

    Finished object creation part of BLISObjectAPI.md.
    
    Details:
    - Filled in remaining section on object creation function reference
      of BLISObjectAPI.md. All object management functions demonstrated as
      part of the example code in examples/oapi are now documented, as well
      as some other functions that are not shown in the example code.
    - Updated variuos links (mostly in function index) to correctly point to
      the object API reference instead of the typed API reference.
    - Added documentation to getijm, setijm.

commit 8217a6a3b68382c62f016c658d337e6086112fef
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 24 13:13:10 2018 -0500

    Moved sandbox README.md to docs/Sandboxes.md.
    
    Details:
    - Relocated sandbox/ref99/README.md to docs/Sandboxes.md and made minor
      edits to the document.

commit b7db29332394324ffd1a73c3847a75e9a5b38c8d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jul 19 11:14:30 2018 -0500

    Explicitly typecast return vals in static funcs.
    
    Details:
    - Added explicit typecasting to various functions (mostly static
      functions), primarily those in bli_param_macro_defs.h,
      bli_obj_macro_defs.h, bli_cntx.h, bli_cntl.h, and a few other header
      files.
    - This change was prompted by feedback from Jacob Gorm Hansen, who
      reported that #including "blis.h" from his application caused a
      gcc to output error messages (relating to types being returned
      mismatching the declared return types) when used via the C++ compiler
      front-end. This is the first pass of fixes, and we may need to
      iterate with additional follow-up commits (#233).

commit fa08e5ead95f9d757af6ab5b095a8bf131e3874d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 17 19:02:15 2018 -0500

    Fixed minor issues in ecbebe7 with mt disabled.
    
    Details:
    - Fixed an unused variable warning in frame/base/bli_rntm.c when
      multithreading is disabled.
    - Fixed a missing variable declaration in bli_thread_init_rntm_from_env()
      when multithreading is disabled.

commit ecbebe7c2e43950dfa369f71c2b83cabe348a046
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 17 18:37:32 2018 -0500

    Defined rntm_t to relocate cntx_t.thrloop (#235).
    
    Details:
    - Defined a new struct datatype, rntm_t (runtime), to house the thrloop
      field of the cntx_t (context). The thrloop array holds the number of
      ways of parallelism (thread "splits") to extract per level-3
      algorithmic loop until those values can be used to create a
      corresponding node in the thread control tree (thrinfo_t structure),
      which (for any given level-3 invocation) usually happens by the time
      the macrokernel is called for the first time.
    - Relocating the thrloop from the cntx_t remedies a thread-safety issue
      when invoking level-3 operations from two or more application threads.
      The race condition existed because the cntx_t, a pointer to which is
      usually queried from the global kernel structure (gks), is supposed to
      be a read-only. However, the previous code would write to the cntx_t's
      thrloop field *after* it had been queried, thus violating its read-only
      status. In practice, this would not cause a problem when a sequential
      application made a multithreaded call to BLIS, nor when two or more
      application threads used the same parallelization scheme when calling
      BLIS, because in either case all application theads would be using
      the same ways of parallelism for each loop. The true effects of the
      race condition were limited to situations where two or more application
      theads used *different* parallelization schemes for any given level-3
      call.
    - In remedying the above race condition, the application or calling
      library can now specify the parallelization scheme on a per-call basis.
      All that is required is that the thread encode its request for
      parallelism into the rntm_t struct prior to passing the address of the
      rntm_t to one of the expert interfaces of either the typed or object
      APIs. This allows, for example, one application thread to extract 4-way
      parallelism from a call to gemm while another application thread
      requests 2-way parallelism. Or, two threads could each request 4-way
      parallelism, but from different loops.
    - A rntm_t* parameter has been added to the function signatures of most
      of the level-3 implementation stack (with the most notable exception
      being packm) as well as all level-1v, -1d, -1f, -1m, and -2 expert
      APIs. (A few internal functions gained the rntm_t* parameter even
      though they currently have no use for it, such as bli_l3_packm().)
      This required some internal calls to some of those functions to
      be updated since BLIS was already using those operations internally
      via the expert interfaces. For situations where a rntm_t object is
      not available, such as within packm/unpackm implementations, NULL is
      passed in to the relevant expert interfaces. This is acceptable for
      now since parallelism is not obtained for non-level-3 operations.
    - Revamped how global parallelism is encoded. First, the conventional
      environment variables such as BLIS_NUM_THREADS and BLIS_*_NT  are only
      read once, at library initialization. (Thanks to Nathaniel Smith for
      suggesting this to avoid repeated calls getenv(), which can be slow.)
      Those values are recorded to a global rntm_t object. Public APIs, in
      bli_thread.c, are still available to get/set these values from the
      global rntm_t, though now the "set" functions have additional logic
      to ensure that the values are set in a synchronous manner via a mutex.
      If/when NULL is passed into an expert API (meaning the user opted to
      not provide a custom rntm_t), the values from the global rntm_t are
      copied to a local rntm_t, which is then passed down the function stack.
      Calling a basic API is equivalent to calling the expert APIs with NULL
      for the cntx and rntm parameters, which means the semantic behavior of
      these basic APIs (vis-a-vis multithreading) is unchanged from before.
    - Renamed bli_cntx_set_thrloop_from_env() to bli_rntm_set_ways_for_op()
      and reimplemented, with the function now being able to treat the
      incoming rntm_t in a manner agnostic to its origin--whether it came
      from the application or is an internal copy of the global rntm_t.
    - Removed various global runtime APIs for setting the number of ways of
      parallelism for individual loops (e.g. bli_thread_set_*_nt()) as well
      as the corresponding "get" functions. The new model simplifies these
      interfaces so that one must either set the total number of threads, OR
      set all of the ways of parallelism for each loop simultaneously (in a
      single function call).
    - Updated sandbox/ref99 according to above changes.
    - Rewrote/augmented docs/Multithreading.md to document the three methods
      (and two specific ways within each method) of requesting parallelism
      in BLIS.
    - Removed old, disabled code from bli_l3_thrinfo.c.
    - Whitespace changes to code (e.g. bli_obj.c) and docs/BuildSystem.md.

commit 323eaaab99752858b12e81e2eb8e416f009a3028
Author: Devangi N. Parikh <dnp@cs.utexas.edu>
Date:   Fri Jul 13 11:40:06 2018 -0500

    Removed left over code from plotting scripts.

commit 60c197736495b47ce974ffb9b43874d1ebcfe78c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jul 12 19:22:14 2018 -0500

    Documented accessor functions in BLISObjectAPI.md.
    
    Details:
    - Added documentation to docs/BLISObjectAPI.md for a handful of
      commonly-used obj_t accessor functions.
    - Minor updates to docs/BLISTypedAPI.md.

commit 77327ad796e11ef67df0cc91d45ed663598ba4df
Merge: 73b0b2a3 9fef8575
Author: Devangi N. Parikh <dnp@cs.utexas.edu>
Date:   Thu Jul 12 17:09:33 2018 -0500

    Merge branch 'master' of https://github.com/flame/blis

commit 73b0b2a3ac1be6dfbe85c116886b4e29d98ac945
Author: Devangi N. Parikh <dnp@cs.utexas.edu>
Date:   Thu Jul 12 16:53:10 2018 -0500

    Created hardware-specific test driver directory.
    
    Details:
    - Created a 'studies' subdirectory within 'test' to be used to house
       test drivers, makefiles, run scripts, matlab plot code, and related
       files that have been customized for collecting performance data on
       specific host machines or product lines. This new setup will help us
       catalog, track, and share test driver materials over time, and in a
       way that facilitates reproducibility.
    - Created an 'skx' subdirectory within 'test/studies' to house various
       level-3 test driver files used to measure performance on SkylakeX
       nodes (specifically, those nodes used by TACC's stampede2 system).

commit 9fef85756d15ee0f977fff6e57acd01c20cba184
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 11 18:40:30 2018 -0500

    Cleaned up loose ends in BLISObjectAPI.md.
    
    Details:
    - Deleted some lines from the API function signatures that did not
      belong (and were only left over from the copy-paste of the typed API).
    - Fixed some paragraph-in-bullet indentation.

commit 80ddeae4629022b69fdf1f1b053a1fcba643c40c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 11 18:31:57 2018 -0500

    Added BLISObjectAPI.md to docs.
    
    Details:
    - Added first draft of BLISObjectAPI.md. (Object management section is
      still missing.)
    - Small fixes to BLISTypedAPI.md found while writing BLISObjectAPI.md.
    - In various .md files, changed ``` verbatim blocks to language
      attributes (e.g. ```c for C code).

commit 038442add39ce629fee0d960b212ce0c95138d46
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 11 12:24:18 2018 -0500

    Added -lpthread to makefile example in BuildSystem.md.
    
    Details:
    - Added missing pthreads library linking to example makefile in
      docs/BuildSystem.md, as well as similar language to build requirements
      at the beginning of the document. Thanks to Stefanos Mavros for
      bringing this to our attention.
    - Updated CREDITS file.

commit bf10d8624e7b5902c9d9189c7c93f318b8e1b9a5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jul 9 18:40:13 2018 -0500

    Small updates to KernelsHowTo.md, BLISTypedAPI.md.
    
    Details:
    - Minor updates to BLISTypedAPI.md, mostly to bring terminology
      up-to-date with the new "typed API" classification.
    - Added contents section to KernelsHowTo.md.

commit 1fd3bce59e43b422e62f9684bca9d1296a29edc3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jul 9 18:20:11 2018 -0500

    Further updates to KernelsHowTo.md, BLISTypedAPI.md.
    
    Details:
    - Added missing level-1v operations to BLISTypedAPI (e.g. axpbyv,
      xpbyv).
    - Updated broken linkes in KernelsHowTo.md based on misnamed anchors.
    - Other minor changes.

commit c40d30a6c920bd2e5a8353a3cd07a7e2b2265758
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jul 9 17:55:54 2018 -0500

    Updated KernelsHowTo.md, BLISTypedAPI.md.
    
    Details;
    - Added missing (basic) information in KernelsHowTo.md for level-1f and
      level-1v kernels.
    - Updated section regarding contexts.

commit f8913c2bf91c0e0fb4e68aedf64a242a19db92a0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jul 7 20:35:13 2018 -0500

    Fixed outdated scalv() calls in penryn l1f kernels.
    
    Details:
    - Fixed stale calls to dscalv() from the dotxf and dotxaxpyf penryn
      kernels that were not updated during the basic/expert API separation
      in e88aeda.

commit e78e71d549ac17ecd52c7b33008df1cd78f1b59e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jul 7 20:18:09 2018 -0500

    Added README.md mention/link to examples/tapi.
    
    Details:
    - Added language to README.md to bring the reader's attention to the
      example code for the typed API (in addition to those for the object
      API).

commit 419ffb158573a26bfec47bac73e4394e7926a7b8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jul 7 20:14:23 2018 -0500

    Updates to README.md.
    
    Details:
    - Updated wiki links according to renamed/relocated files in 'docs'.
    - Converted links to relative paths.
    - Added link to docs/Multithreading.md.

commit 7d3e8a7e5f1ec299d009fb6c9071f0c1b089b460
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jul 7 20:01:29 2018 -0500

    Reverted docs/*.md links to relative paths.
    
    Details:
    - Within the documents in docs/*.md, reverted links to other local
      documents to relative paths.
    - Fixed some links/documents that did not yet have the '.md' suffix.
    - Testing whether we can use relative links ('docs/BLISTypedAPI.md')
      from within README.md.

commit d97c862c2b9170d774f414e63ae365488fffb4f5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jul 7 19:40:41 2018 -0500

    Updated links (URLs) in docs/*.md.
    
    Details:
    - Updated most markdown links in the documents/wikis to use absolute
      paths instead of the relative paths that were in use previously.
      A few links were not updated, except for adding a ".md" to reflect
      the documents' new names, in order to test whether relative
      linking still works.

commit 3a0c12135875e0fb04de9798664e4fae632d994e
Merge: 2c7960c8 bcacddfa
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jul 7 16:51:38 2018 -0500

    Merge branch 'dev'

commit bcacddfad75b20969660606751eea6ead6c42ca9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jul 7 16:45:29 2018 -0500

    Added 'docs' directory with wiki markdown files.
    
    Details:
    - Exported all github wikis to a new 'docs' directory.
    - Renamed 'BLISAPIQuickReference' wiki to 'BLISTypedAPI' and removed
      all cntx_t* arguments from the (now non-expert) APIs (with the
      exception of the kernel APIs).
    - Added section to BuildSystem documenting new ARG_MAX hack.

commit 3ee2bc0f7aa3b08da92331d64271bee99eaf8c1d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jul 7 16:02:16 2018 -0500

    Renamed files that distinguish basic/expert APIs.
    
    Details:
    - Renamed various files that were previously named according to a
      "with context" or "without context" convention. For example, the
      following files in frame/3 were renamed:
    
        frame/3/bli_l3_oapi_woc.c -> frame/3/bli_l3_oapi_ba.c
        frame/3/bli_l3_oapi_wc.c  -> frame/3/bli_l3_oapi_ex.c
        frame/3/bli_l3_tapi_woc.c -> frame/3/bli_l3_tapi_ba.c
        frame/3/bli_l3_tapi_wc.c  -> frame/3/bli_l3_tapi_ex.c
    
      Here, the "ba" is for "basic" and "ex" is for "expert". This new
      naming scheme will make more sense especially if/when additional
      expert parameters are added to the expert APIs (typed and object).

commit e88aedae735dfeb6fa5ac28d4527eb3ca58c6510
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jul 6 19:14:02 2018 -0500

    Separated expert, non-expert typed APIs.
    
    Details:
    - Split existing typed APIs into two subsets of interfaces: one for use
      with expert parameters, such as the cntx_t*, and one without. This
      separation was already in place for the object APIs, and after this
      commit the typed and object APIs will have similar expert and non-
      expert APIs. The expert functions will be suffixed with "_ex" just as
      is the case for expert interfaces in the object APIs.
    - Updated internal invocations of typed APIs (functions such as
      bli_?setm() and bli_?scalv()) throughout BLIS to reflect use of the
      new explictly expert APIs.
    - Updated example code in examples/tapi to reflect the existence (and
      usage) of non-expert APIs.
    - Bumped the major soname version number in 'so_version'. While code
      compiled against a previous version/commit will likely still work
      (since the old typed function symbol names still exist in the new API,
      just with one less function argument) the semantics of the function
      have changed if the cntx_t* parameter the application passes in is
      non-NULL. For example, calling bli_daxpyv() with a non-NULL context
      does not behave the same way now as it did before; before, the
      context would be used in the computation, and now the context would
      be ignored since the interace for that function no longer expects a
      context argument.

commit 331694e52414c0cd50048daf880a9ace9e29b94a
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Fri Jul 6 09:07:38 2018 -0600

    Fix windows build and enable x86_64 on appveyor (#230)
    
    * Upload artifacts built on appveyor (#228)
    
    * Upload artifacts
    
    * Fix install in appveyor
    
    * Remove windows.h in bli_winsys.c (#229)
    
    Looks like it is unneeded.
    
    * Implemented ARG_MAX hack in configure, Makefile.
    
    Details:
    - Added support for --enable-arg-max-hack to configure, which will
      change the behavior of make when building BLIS so that rather than
      invoke the archiver/linker with all of the object files as command
      line arguments, those object files are echoed to a temporary file
      and then the archiver/linker is fed that temporary file via the @
      notation. An example of this can be found in the GNU make docs at
      https://www.gnu.org/software/make/manual/make.html#File-Function
    - Thanks to Isuru Fernando for prompting this feature.
    
    * Enable x86_64 and arg-max-hack on appveyor
    
    * Use gas style assembly for clang on windows

commit a64a780d28c99d35f237f59212772e9beff35b3e
Merge: 89e178ce 3cb396d1
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Jul 6 09:38:42 2018 -0500

    Merge pull request #231 from flame/travis-pr
    
    Disable SDE for PRs

commit 3cb396d1ae4ee569f862db201c6a976712fd128e
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Jul 6 09:19:44 2018 -0500

    Disable SDE for PRs
    
    Pull requests cannot use Travis secret variables, so SDE needs to be disabled. This PR should suffice as a test.

commit 2c7960c8416ee9b67364be5f2b210fd7a0aec4b5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jul 5 14:38:33 2018 -0500

    Implemented ARG_MAX hack in configure, Makefile.
    
    Details:
    - Added support for --enable-arg-max-hack to configure, which will
      change the behavior of make when building BLIS so that rather than
      invoke the archiver/linker with all of the object files as command
      line arguments, those object files are echoed to a temporary file
      and then the archiver/linker is fed that temporary file via the @
      notation. An example of this can be found in the GNU make docs at
      https://www.gnu.org/software/make/manual/make.html#File-Function
    - Thanks to Isuru Fernando for prompting this feature.

commit c422a5cd191d47e6aeb9cea6de0e348f46e3e318
Merge: b6470262 89e178ce
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jul 5 12:33:35 2018 -0500

    Merge branch 'dev'

commit b6470262ea66c0f48a5b4d85ca4bf85c1fb2b3af
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Wed Jul 4 19:14:29 2018 -0600

    Remove windows.h in bli_winsys.c (#229)
    
    Looks like it is unneeded.

commit eac4bdf98691c5ec784af0dc11d1ad2269840661
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Wed Jul 4 18:31:01 2018 -0600

    Upload artifacts built on appveyor (#228)
    
    * Upload artifacts
    
    * Fix install in appveyor

commit 89e178ce380439dea951925e33703dc4b979e914
Merge: d868eb3e e32b2ef9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 4 17:51:16 2018 -0500

    Merge branch 'master' into dev

commit e32b2ef983ea1c3521dd3821116c0078690f125e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 4 17:49:39 2018 -0500

    Update to CREDITS file.

commit 14648e137696484e0ff04f89b16c6b4183ea42b8
Author: Isuru Fernando <isuruf@gmail.com>
Date:   Wed Jul 4 16:48:42 2018 -0600

    Native windows support using clang (#227)
    
    * Add appveyor file
    
    * Build script
    
    * Remove fPIC for now
    
    * copy as
    
    * set CC and CXX
    
    * Change the order of immintrin.h
    
    * Fix testsuite header
    
    * Move testsuite defs to .c
    
    * Fix appveyor file
    
    * Remove fPIC again and fix strerror_r missing bug
    
    * Remove appveyor script
    
    * cd to blis directory
    
    * Fix sleep implementation
    
    * Add f2c_types_win.h
    
    * Fix f2c compilation
    
    * Remove rdp and rename appveyor.yml
    
    * Remove setenv declaration in test header
    
    * set CPICFLAGS to empty
    
    * Fix another immintrin.h issue
    
    * Escape CFLAGS and LDFLAGS
    
    * Fix more ?mmintrin.h issues
    
    * Build x86_64 in appveyor
    
    * override LIBM LIBPTHREAD AR AS
    
    * override pthreads in configure
    
    * Move windows definitions to bli_winsys.h
    
    * Fix LIBPTHREAD default value
    
    * Build intel64 in appveyor for now

commit b45ea92fc6f77f2313b50dbe95922f838cbead07
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 3 18:27:29 2018 -0500

    Added typed (BLAS-like) API code examples.
    
    Details:
    - Added new example code to examples/tapi demonstrating how to use the
      BLIS typed API. These code examples directly mirror the corresponding
      example code files in examples/oapi. This setup provides a convenient
      opportunity for newcomers to BLIS to compare and contrast the typed
      and object APIs when they are used to perform the same tasks.
    - Minor cleanups to examples/oapi.

commit d868eb3e200f657a1284c4cc933e7a4d25260dce
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jun 29 12:36:04 2018 -0500

    Implemented bli_obj_scalar_cast_to().
    
    Details:
    - Implemented bli_obj_scalar_cast_to(), which will typecast the value in
      the internal scalar of an obj_t to a specified datatype.
    - Changed bli_obj_scalar_attach() so that the scalar value being attached
      is first typecast to the storage datatype of the destination object
      rather than the target datatype.
    - Reformatted function type signatures in bli_obj_scalar.c as well as
      prototypes  in its corresponding header file.

commit 52d80b5f09517d80ac8a7c96983a576c1ec2080b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jun 29 12:30:44 2018 -0500

    Fixed static funcs related to target and exec dts.
    
    Details:
    - Fixed incorrect bit shifts in the following static functions:
        bli_obj_set_target_domain()
        bli_obj_set_target_prec()
        bli_obj_set_exec_domain()
        bli_obj_set_exec_prec()
    - Fixed incorrect bitmask in bli_dt_proj_to_single_prec().
    - Updated bli_obj_real_part() and bli_obj_imag_part() so that it updates
      the target and exec datatypes (in addition to the storage datatypes).

commit e006f2d0eeb229c1cd05a424496a774c29bdc5d7
Merge: bd8c55fe dafca7a0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jun 27 15:54:38 2018 -0500

    Merge branch 'dev' of github.com:flame/blis into dev

commit bd8c55fe268e8e352508341ebd739ef4fc68eb92
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jun 27 15:52:37 2018 -0500

    Added dt_on_output field to auxinfo_t.
    
    Details:
    - Added a new field to the auxinfo_t struct that can be used, in theory,
      to request type conversion before the microkernel stores/accumulates
      its microtile back to memory.
    - Added the appropriate get/set static functions to bli_type_defs.h.

commit dafca7a0c2c72aaf15cb588b2bef6f246abb1905
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Jun 25 16:20:10 2018 -0500

    Fix botched memory addressing in Penryn kernel (no effect for GAS output).

commit de493b0f349efebab98ab17f063d4d3d932c24c3
Merge: 195480be a7166feb
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Jun 25 14:26:06 2018 -0500

    Merge pull request #226 from devinamatthews/dev
    
    Finish macroization of assembly ukernels.

commit 195480beb589db7d582646f556e855c611d4c3a9
Merge: 07c3d0a9 3f387ca3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 25 13:24:21 2018 -0500

    Merge branch 'master' into dev

commit 3f387ca35e42519f0d6a154814e4c8800fa2acb8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 25 12:32:03 2018 -0500

    Fixed bugs in configure's select_cc() function.
    
    Details:
    - This commit fixes several bugs in configure relating to selecting a C
      compiler. By dumb luck, two of the two bugs sort of cancelled each
      other out in most use cases, which manifested as the expected behavior.
      Thanks to Mathieu Poumeyrol for bringing this issue to our attention,
      and to Devin Matthews for suggesting the more portable way of
      capturing both stdout and stderr and suggesting a return code check
      instead of testing stdout/stderr.
    - The first bug: As the values of the compiler search list are iterated
      over, only stderr is captured when querying a compiler with --version
      rather than both stdout and stderr.
    - The second bug: After each query, a conditional attempted to test
      whether the query resulted in anything being output. That conditional
      erroneously was using "-z" instead of "-n" for non-emptiness. Thus,
      most of the time, stderr was empty (because the --version info was
      being output on stdout), and since it was empty, the -z conditional
      (intended to execute only when a compiler was found to be responsive)
      executed.
    - A third bug was also fixed in the way that the merged stdout/stderr
      output was tested for non-emptiness (moving the 'cat' invocation to
      another line and testing the contents of a variable instead).
    - The three bugs above have been fixed as part of a partial rewrite of
      the select_cc() function in terms of a return code check, which
      obviated the need to save the output of stdout and stderr.
    - The fourth bug involved a misnamed variable in the right-hand side
      of a statement intended to prepend CC to search_list when CC was
      non-empty. This typically did not manifest as a bug since usually CC
      (if it was set) was set to a value that was known to work.

commit a7166feb1053814b7dd27f3879ae38acfc9637fc
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Jun 25 12:09:18 2018 -0500

    Finish macroization of assembly ukernels.

commit f986396c2af5de06283b9834112782afd0a8907e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jun 22 18:12:40 2018 -0500

    Added 'configure --help' text for CFLAGS, LDFLAGS.
    
    Details:
    - Added mention of the new support for preset CFLAGS, LDFLAGS to the
      bottom of the text output by './configure --help'.
    - Updated usage example to use 'haswell' instead of 'sandybridge'.

commit 884175d9ffb62e49535e6c1f7d58fb3b83e7e78f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jun 22 18:08:43 2018 -0500

    Added configure support for preset CFLAGS, LDFLAGS.
    
    Details:
    - Any preexisting values set to the CFLAGS environment variable (or the
      CFLAGS variable if given on the command line) are saved by configure
      for later inclusion (prepending, to be precise) along with the
      compiler flags automatically determined by the BLIS build system.
      LDFLAGS is treated in a similar manner.) Thanks to Dave Love for
      requesting this feature in issue #223 and Mathieu Poumeyrol for his
      support on this and a previous related issue.
    - Comment updates to build/config.mk.in.
    - Strip whitespace from return value of various cflags functions in
      common.mk.

commit 07c3d0a95190bd23f0cd2ef220deb3384d8378d1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jun 21 12:35:07 2018 -0500

    Update to CREDITS file.

commit a1ebbbf158c7b34c9032ef45431bc610b6f14858
Merge: 17928b1c c81c6f23
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Jun 20 15:37:53 2018 -0500

    Merge pull request #224 from devinamatthews/asm-macros
    
    Asm macros

commit c81c6f23b9547b5d55ae68fd5a3bbd8a78290b6b
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Jun 20 15:20:44 2018 -0500

    Fix problem with inc and dec macros.

commit 5a63971c822fd452f97ba869625c8e87f6cbeebc
Merge: b4d94e54 17928b1c
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Jun 20 14:07:49 2018 -0500

    Merge remote-tracking branch 'upstream/dev' into asm-macros

commit b4d94e54d44cf30e4bb452ca5263be3473c0582d
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Jun 20 14:07:24 2018 -0500

    Convert x86 microkernels to assembly macros.

commit 17928b1c9941aa58aef1f122c793e2b14e705267
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jun 19 17:59:03 2018 -0500

    Added static funcs bli_dt_domain(), bli_dt_prec().
    
    Details:
    - Added definitions of static functions bli_dt_domain()/bli_dt_prec(),
      which extract a dom_t domain or prec_t precision value, respectively,
      from a num_t datatype.
    - Changed the return types of bli_obj_domain() and bli_obj_prec() from
      objbits_t to dom_t and prec_t. (Not sure why they were ever set to
      return objbits_t.)

commit 5f7fbb7115b1bf532c169dfd9adef84c41a95031
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jun 19 15:38:55 2018 -0500

    Static funcs for projecting dt to single/double.
    
    Details:
    - Added static functions for projecting a datatype to single precision
      or double precision, both for obj_t's storage datatypes and standalone
      datatypes.

commit d4a22702c7a90273dc14f271db465c2e11e5b87e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jun 19 14:54:57 2018 -0500

    Set up haswell config for optional col-pref ukrs.
    
    Details:
    - Added two presently-disabled cpp blocks in bli_cntx_init_haswell.c to
      easily allow one to switch to a set of column-preferential gemm
      microkernels (in the haswell subconfiguration). The second column-
      preferring block sets the the register blocksizes to their appropriate
      values. However, cache blocksizes are left unchanged, and therefore are
      likely suboptimal. This should be addressed later.

commit f317c2e31bfc329cb6bb4e06005e45b9c8a9d6a7
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jun 19 12:21:23 2018 -0500

    Added get/set static funcs for exec dt/dom/prec.
    
    Details:
    - Added functions to bli_obj_macro_defs.h to get and set the target
      domain and target precision bits in the obj_t, and also added the
      appropriate support in bli_type_defs.h.

commit e88a5b8da8c26caebd2b0fb73b30836fb5417c9c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 18 15:56:26 2018 -0500

    Implemented castm, castv operations.
    
    Details:
    - Implemented castm and castv operations, which behave like copym and
      copyv except where the obj_t operands can be of different datatypes.
      These new operations, however, unlike copym/copyv, do not build upon
      existing level-1v kernels.
    - Reorganized projm, projv into a 'proj' subdirectory of frame/base (to
      match the newly added frame/base/cast directory).
    - Added new macros to bli_gentfunc_macro_defs.h, _gentprot_macro_defs.h
      that insert GENTFUNC2/GENTPROT2 macros for all non-homogeneous datatype
      combinations. Previously, one had to invoke two additional macros--one
      which mixed domains only and another that included all remaining
      cases--in order to get full type combination coverage.
    - Defined a new static function, bli_set_dims_incs_2m(), to aid in the
      setting of various variables in the implementations of bli_??castm().
      This static function joins others like it in bli_param_macro_defs.h.
    - Comment update to bli_copysc.h.

commit 2000cdff59272974438e88e0e82d8e1a32710325
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 18 14:17:28 2018 -0500

    Update to CREDITS file.

commit ed2c8aed848ba2dede18df090cf2e0b6e4cc059f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 18 11:49:34 2018 -0500

    Temporarily disabled small matrix handling on zen.
    
    Details:
    - Disabled small matrix handling in config/zen/bli_family_zen.h due to
      what appears to be a bug that manifests as failures in the single and
      double precision real level-3 BLAS test drivers (visible via
      out.sblat3 and out.dblat3). Thanks to Robin Christ for reporting this
      issue.

commit ed20392c500940bfc0947795c1ff7c8c24f8e26f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jun 15 16:31:22 2018 -0500

    Added get/set static funcs for exec dt/dom/prec.
    
    Details:
    - Added functions to bli_obj_macro_defs.h to get and set the execution
      domain and execution precision bits in the obj_t.
    - Added/rearranged a few functions in bli_obj_macro_defs.h.
    - Renamed some macros in bli_type_defs.h: EXECUTION -> EXEC.

commit 22594e8e9ab55f5bc0e69d96a23e128502849999
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jun 14 17:35:23 2018 -0500

    Updated sandbox/ref99 according to f97a86f.
    
    Details:
    - Applied changes to ref99 sandbox analagous to those applied to
      framework code in f97a86f. This involves setting the pack schemas of
      A and B objects temporarily to communicate those desired schemas to
      the control tree creation function in blx_gemm_cntl.c. This allows us
      to (henceforth) query the schemas from the control tree rather than
      the context.

commit 1b5d0424d2c7e5eac33e02359c12917ef280949f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jun 13 18:41:32 2018 -0500

    Prototype column-preferential zen gemm ukernels.
    
    Details:
    - Added prototypes to bli_kernels_zen.h for each of the four gemm
      microkernels that prefer outputting to column storage.

commit f88c2e7a539e383297e846e6d4647058dd3db128
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jun 13 18:27:46 2018 -0500

    Defined static function bli_blksz_scale_def_max().
    
    Details:
    - Added a new static function to bli_blksz.h that scales both the default
      (regular) blocksize as well as the maximum blocksize in the blksz_t
      object. Reminder: maximum blocksizes have different meanings in
      different contexts. For register blocksizes, they refer to the packing
      register blocksizes (PACKMR or PACKNR) while for cache blocksizes, they
      refer to the maximum blocksize to use during the final iteration of a
      loop.

commit 87db5c048e0c7f37351fda486abaf7d19fc5821c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jun 12 19:38:37 2018 -0500

    Changed usage of virtual microkernel slots in cntx.
    
    Details:
    - Changed the way virtual microkernels are handled in the context.
      Previously, there were query routines such as bli_cntx_get_l3_ukr_dt()
      which returned the native ukernel for a datatype if the method was
      equal to BLIS_NAT, or the virtual ukernel for that datatype if the
      method was some other value. Going forward, the context native and
      virtual ukernel slots will both be initialized to native ukernel
      function pointers for native execution, and for non-native execution
      the virtual ukernel pointer will be something else. This allows us
      to always query the virtual ukernel slot (from within, say, the
      macrokernel) without needing any logic in the query routine to decide
      which function pointer (native or virtual) to return. (Essentially,
      the logic has been shifted to init-time instead of compute-time.)
      This scheme will also allow generalized virtual ukernels as a way
      to insert extra logic in between the macrokernel and the native
      microkernel.
    - Initialize native contexts (in bli_cntx_ref.c) with native ukernel
      function addresses stored to the virtual ukernel slots pursuant to
      the above policy change.
    - Renamed all static functions that were native/virtual-ambiguous, such
      as bli_cntx_get_l3_ukr_dt() or bli_cntx_l3_ukr_prefers_cols_dt()
      pursuant to the above polilcy change. Those routines now use the
      substring "get_l3_vir_ukr" in their name instead of "get_l3_ukr". All
      of these functions were static functions defined in bli_cntx.h, and
      most uses were in level-3 front-ends and macrokernels.
    - Deprecated anti_pref bool_t in context, along with related functions
      such as bli_cntx_l3_ukr_eff_dislikes_storage_of(), now that 1m's
      panel-block execution is disabled.

commit dbaf440540837b03643190cd685ed889fa7fd212
Merge: 22aa44eb 2610fff0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 11 12:37:04 2018 -0500

    Merge branch 'master' into dev

commit 2610fff0b07bdb345cb2e334ef6bea0c63c8cead
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 11 12:32:54 2018 -0500

    Renamed 1m packm kernels from _1e to _1er.
    
    Details:
    - Renamed the reference packm kernels used by 1m. Previously, they used
      a _1e suffix, which was confusing since they packed to both 1e and 1r
      schemas. This was likely an artifact of the time when there were
      separate kernels for each schema before I decided to combine them into
      a single function (per datatype and panel dimension), and the 1e
      functions were the ones to inherit the 1r functionality. The kernels
      have now been renamed to use a _1er suffix.

commit 7af5283dcc3dded114852d6013d33134021b81aa
Author: sraut <Biplab.Raut@amd.com>
Date:   Mon Jun 11 15:00:22 2018 +0530

    added check condition on n-dimension for XA'=B intrinsic code to process till 128 size
    
    Change-Id: I95d020a5ca3ea21d446b8c2e379d56e1eea18530

commit 712de9b371a8727682352a2f52cd4880de905f0b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jun 9 14:36:30 2018 -0500

    Added missing semicolon in 03obj_view.c
    
    Details:
    - Thanks to Tony Skjellum for pointing out this typo due to a
      last-minute change to the source prior to committing.

commit 043d0cd37ef4a27b1901eeb89d40083cfb2a57ba
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jun 9 13:46:49 2018 -0500

    Implemented bli_acquire_mpart(), added example code.
    
    Details:
    - Implemented bli_acquire_mpart(), a general-purpose submatrix view
      function that will alias an obj_t to be a submatrix "view" of an
      existing obj_t.
    - Renumbered examples in examples/oapi and inserted a new example file,
      03obj_view.c, which shows how to use bli_acquire_mpart() to obtain
      submatrix views of existing objects, which can then be used to
      indirectly modify the parent object.

commit f1908d39767baef56077def69126d96f805ee27e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jun 8 14:22:22 2018 -0500

    Fixed broken input.operations.fast.
    
    Details:
    - Removed three input lines from input.operations.fast (labeled
      "test sequential micro-kernel") that I intended to remove in bd02c4e.
      These lines prevented 'make check' (and 'make checkblis-fast') from
      completing correctly. Note: This bug was fixed in 3df39b3, but that
      commit has not yet been merged into master, hence this redundant
      commit. Thanks to Robert van de Geijn for reporting this issue.

commit 262a62e3482c5caa947a89cabb562b5887555bd6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jun 8 12:10:54 2018 -0500

    Fixed undefined ref in steamroller/excavator configs.
    
    Details:
    - Fixed erroneous calls to bli_cntx_init_piledriver_ref() in
      bli_cntx_init_steamroller() and bli_cntx_init_excavator(), which
      should have been to their respectively-named bli_cntx_init_*()
      functions instead. Thanks to qnerd for bringing these bugs to our
      attention.

commit 22aa44ebec2c7884bdc944775a1aa7534ab53f0d
Merge: 65fae950 b65d0b84
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jun 7 17:42:59 2018 -0500

    Merge branch 'dev' of github.com:flame/blis into dev

commit 65fae95074d239354737355bbe6f202d4f8b2871
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jun 7 17:41:09 2018 -0500

    Implemented bli_setrm, _setim, _setrv, _setiv.
    
    Details:
    - Defined new wrappers to setm/setv operations in frame/base/bli_setri.c
      that will target only the real or only the imaginary parts of a
      matrix/vector object.
    - Updated bli_obj_real_part() so that the complex-specific portions of
      the function are not executed if the object is real.
    - Defined bli_obj_imag_part().
      - Caveat: If bli_obj_imag_part() is called on a real object, it does
        nothing, leaving the destination object untouched. The caller must
        take care to only call the function on complex objects.
    - Reordered some of the static functions in bli_obj_macro_defs.h related
      to aliasing.

commit b65d0b841b7e4357bc2cf743bbb03384a3ab0bfa
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jun 7 14:38:41 2018 -0500

    Fixed bug in bli_dt_proj_to_complex().
    
    Details:
    - Fixed a bug identical to the one fixed in 0a4a27e, except this time in
      the bli_obj_param_defs.h header file. It looks like the only consumers
      of this static function were in bli_l0_oapi.c, and so this may not have
      been manifesting (yet).

commit 55b6abdf7458e31df3ad01796d67c2332c776948
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jun 7 14:08:12 2018 -0500

    Enforce consistent datatypes in most object APIs.
    
    Details:
    - Added logic to level-1v, -1d, -1f, -1m, -2, and -3 operations' _check()
      functions to ensure that all operands are of the same datatype. There
      are some exceptions that were left out, such as the _check() function
      for the various norm operations since they have a different idea of
      datatype consistency (ie: the norm object must be the real projection
      of the primary input vector/matrix object).

commit 513138b1a1ecebd015580423c779810cae5c67f2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jun 7 12:24:47 2018 -0500

    Defined/implemented bli_projv().
    
    Details:
    - Added an implementation for bli_projv() to go along with the
      implementation of bli_projm() added in 0a4a27e. The only difference
      between the two is that bli_projv() may only be used on vectors,
      whereas bli_projm() is general-purpose.
    - Added a _check() function corresponding to bli_projv().

commit 5f71c1e719eb482b2a4e40daa280c4f7d05b6963
Merge: b5a641e9 3df39b37
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jun 6 19:06:14 2018 -0500

    Merge branch 'dev' of github.com:flame/blis into dev

commit b5a641e968469805906eb2c971384d12ad1beac5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jun 6 19:05:37 2018 -0500

    Added char-to-dt and dt-to-char mapping functions.
    
    Details:
    - Defined additional functions in bli_param_map.c:
        bli_param_map_char_to_blis_dt()
        bli_param_map_blis_to_char_dt()
      which will map a char to its corresponding num_t, or vice versa.

commit 0a4a27e1a4487480410bc0b1bb034bcf97583214
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jun 6 19:02:29 2018 -0500

    Defined/implemented bli_projm().
    
    Details:
    - Defined a new operation in frame/base/bli_proj.c, bli_projm(), which
      behaves like bli_copym(), except that operands a and b are allowed to
      contain data of differing domains (e.g. a is real while b is complex,
      or vice versa). The file is named bli_proj.c, rather than bli_projm.c,
      with the intention that a 'v' vector version of the function may be
      added to the same file (at some point in the future).
    - Added supporting bli_check_*() functions in bli_check.c to confirm
      consistent precisions between to datatypes/objects, as well as the
      appropriate error message in bli_error.c and a new error code in
      bli_type_defs.h.
    - Wrote a bli_projm_check() function to go along with bli_projm().
    - Defined static function bli_obj_real_part() in bli_obj_macro_defs.h,
      which will initialize an obj_t alias to the real part of the source
      object.
    - Fixed a bug in the static function bli_dt_proj_to_complex(), found
      in bli_param_macro_defs.h. Thankfully, there were no calls to the
      function to produce buggy behavior.

commit 3df39b37a0134befa34b6b6259db98467c7bc965
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jun 6 15:35:05 2018 -0500

    Fixed recently broken input.operations.fast.
    
    Details:
    - Removed "test sequential front-end" lines from microkernel test
      entries of input.operations.fast. This change was meant for inclusion
      in bd02c4e but was missed due to slightly different wording of the
      comment (I used "sed //d" to remove the lines). This fixes the broken
      'make checkblis-fast' (and 'make check') targets.

commit 695cd520e2f5eab938f66afe9fe36201ab2700c5
Author: sraut <Biplab.Raut@amd.com>
Date:   Wed Jun 6 11:48:56 2018 +0530

    AMD Copyright information changed to 2018
    
    Change-Id: Idfd11afd5d252f8063d0158680d24bf7e2854469

commit df1dd24fd896821de60917b429f303bab7fd0d4b
Author: sraut <Biplab.Raut@amd.com>
Date:   Wed Jun 6 11:24:33 2018 +0530

    small matrix trsm intrinsics optimization code for AX=B and XA'=B
    
    Change-Id: I90123c4d9adbd314c867995cd19dc975150b448c

commit 3f48c38164b4135515b5c752c506fdccc4480be2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jun 5 16:52:35 2018 -0500

    Cosmetic fix to configure output in config.mk.
    
    Details:
    - Fixed configure so that MK_ENABLE_MEMKIND is assigned "no" when the
      option is disabled due to libmemkind not being present. This wasn't
      affecting anything since the one use of the variable (in common.mk)
      was formulated as "ifeq ($(MK_ENABLE_MEMKIND),yes)". That is, the
      variable being empty was effectively equivalent to it being set to
      "no".
    - Comment updates to build/config.mk.in, common.mk.

commit 5df201260f64aa98a365931f6d2da70144d69932
Merge: 1b9af85e 96d2774b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jun 5 16:14:19 2018 -0500

    Merge branch 'master' into dev

commit 1b9af85ec98d91bb2b27aadaa3df344d18faff35
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jun 5 16:07:13 2018 -0500

    Updated ref99 call to _cntx_set_thrloop_from_env().
    
    Details:
    - Reordered the arguments in the ref99 sandbox's call to
      bli_cntx_set_thrloop_from_env() to be consistent with the updated
      function signature from f97a86f. Thanks to Devangi Parikh for
      reporting this issue.

commit 96d2774b4cb44ff1e8b5798d7cfc83154a607624
Author: Tyler Michael Smith <tms@cs.utexas.edu>
Date:   Tue Jun 5 14:17:39 2018 +0200

    Make bli_auxinfo_next_b() return b_next, not a_next (#216)

commit d4c24ea5f644eb635046e7fe249d3e8e58b4c98a
Author: sraut <biplab.raut@amd.com>
Date:   Tue Jun 5 15:42:59 2018 +0530

    copyright message changed to 2018
    
    Change-Id: I33c1ebda41bc7f1973ff19e3b1947bdad62b4d44

commit 3f1ba4e646776699ebfaa042fe24691d9e2f55d0
Author: sraut <biplab.raut@amd.com>
Date:   Tue Jun 5 14:21:13 2018 +0530

    copyright changed to 2018
    
    Change-Id: Ie916c7cd6f95aedc3cab6eec3a703c9ddb333bc3

commit bd02c4e9f7fe07487276e61507335d48c8e05f35
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 4 13:42:17 2018 -0500

    Cleanups to testsuite, input.operations format.
    
    Details:
    - Removed the line in each operation entry in input.operations titled
      "test sequential front-end" and the corresponding support for the lines
      in the testsuite input parsing code. This line was included in the some
      of the earliest versions of the testsuite, back when I intended to
      eventually have separate multithreaded APIs. Specifically, I envisioned
      that multithreaded and sequential testing could be enabled or disabled
      on an operation level. However, BLIS evolved in a different direction
      and still does not have multithreaded-specific APIs (even if it will
      eventually someday). But even if it did have such APIs, I doubt I would
      allow the user to enable/disable them on an operation level. Thus, this
      was a zombie future parameter that was never used and never made sense
      to begin with. The one instance of the front_seq variable, used in the
      various libblis_test_<operation>() functions to guard the call to the
      operation test driver, that remains was commented out instead of
      deleted so that someday it could be easily changed via sed, if desired.
    - Various minor cleanups to the testsuite code, including consolidating
      use of DISABLE and DISABLE_ALL and reexpressing certain conditional
      expressions in the libblis_test_<operation>() functions in terms of
      boolean functions.

commit 2c6d99b99e50d70f904da298a0c59be16cc5c180
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Jun 3 18:13:36 2018 -0500

    Fixed names out of alphabetical order in CREDITS.

commit 7a207e8f2c5046f8b295a78e029ff2de765c7409
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Jun 3 18:04:27 2018 -0500

    Disabled indirect blacklisting (issue #214).
    
    Details:
    - Return early from function, pass_config_kernel_registries(), that
      implements indirect blacklisting of subconfigurations (during pass 0).
      In short, I realized that indirect blacklisting is not needed in the
      situations I envisioned, and can actually cause problems under certain
      circumstances. Thanks to Tony Skjellum for reporting the issue (#214)
      that led to this commit, and to Devin Matthews for prompting me to
      realize that indirect blacklisting was unnecessary, at least as
      originally envisioned.

commit d7fb32682057c7458c8891c0eedafc374fd9beef
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Jun 3 13:20:37 2018 -0500

    Fixed syntax artifacts from 4b36e85 in examples.
    
    Details:
    - Fixed artifacts of malformed recursive sed expressions used when
      preparing 4b36e85, in which most function-like macros were converted
      to static functions. The syntactically defective code was contained
      entirely in examples/oapi. Thanks to Tony Skjellum for reporting this
      issue.
    - Update to CREDITS file.

commit ed7dedfd4a07eefeb5a038f9899afb8053b45383
Merge: f97a86f3 469727d4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jun 2 20:29:53 2018 -0500

    Merge branch 'master' into dev

commit f97a86f322a6e3e31f33c89befc66189b0b8c64f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jun 2 20:28:20 2018 -0500

    Updated setting/querying pack schema (cntx->cntl).
    
    - Query pack schemas in level-3 bli_*_front() functions and store those
      values in the schema bitfields of the correponding obj_t's when the
      cntx's method is not BLIS_NAT. (When method is BLIS_NAT, the default
      native schemas are stored to the obj_t's.)
    - In bli_l3_cntl_create_if(), query the schemas stored to the obj_t's in
      bli_*_front(), clear the schema bitfields, and pass the queried values
      into bli_gemm_cntl_create() and bli_trsm_cntl_create().
    - Updated APIs for bli_gemm_cntl_create() and bli_trsm_cntl_create() to
      take schemas for A and B, and use these values to initialize the
      appropriate control tree nodes. (Also cpp-disabled the panel-block cntl
      tree creation variant, bli_gemmpb_cntl_create(), as it has not been
      employed by BLIS in quite some time.)
    - Simplified querying of schema in bli_packm_init() thanks to above
      changes.
    - Updated openmp and pthreads definitions of bli_l3_thread_decorator()
      so that thread-local aliases of matrix operands are guaranteed, even
      if aliasing is disabled within the internal back-end functions (e.g.
      bli_gemm_int.c). Also added a comment to bli_thrcomm_single.c
      explaining why the extra aliasing is not needed there.
    - Change bli_gemm() and level-3 friends so that the operation's ind()
      function is called only if all matrix operands have the same datatype,
      and only if that datatype is complex. The former condition is needed
      in preparation for work related to mixed domain operands, while the
      latter helps with readability, especially for those who don't want to
      venture into frame/ind.
    - Reshuffled arguments in bli_cntx_set_thrloop_from_env() to be
      consistent with BLIS calling conventions (modified argument(s) are
      last), and updated all invocations in the level-3 _front() functions.
    - Comment updates to bli_cntx_set_thrloop_from_env().

commit 965db85d29977d228ea744581edf2b682eb8e8a8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jun 1 12:32:15 2018 -0500

    Updated macro invocations in bli_gemm_ker_var2.c.
    
    Details:
    - Updated "get next a/b micropanel" macro invocations in
      bli_gemm_ker_var2.c according to changes in 9588625.
    - Comment update in bli_cntx.c.

commit 8749fa0b48a7710f4115023e2c46bc80167bc8f9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu May 31 12:34:01 2018 -0500

    Cleanups to ref99/README.md, test/3m4m/Makefile.
    
    Details:
    - Minor edits to sandbox/ref99/README.md.
    - Removed cpp guards in sandbox/ref99/thread/blx_gemm_thread.h to be
      consistent with other headers in sandbox/ref99.
    - Additional targets and related cleanups in test/3m4m/Makefile.

commit 9588625c43c86ef1bde8140f620a30f52420e6a6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed May 30 15:19:53 2018 -0500

    Renamed "next micropanel" macros in _l3_thrinfo.h.
    
    Details:
    - Renamed several macros defined in bli_l3_thrinfo.h designed to compute
      the values of a_next and b_next to insert into an auxinfo_t struct in
      level-3 macrokernels. (Previously, the macros did not use a bli_
      prefix.)
    - Updated instances of above macro usage within various macrokernels.

commit e4420591225fca2f63ca74ef6a23b962fcd4bec0
Merge: 34f974d1 850a8a46
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 29 17:12:22 2018 -0500

    Merge branch 'dev' of github.com:flame/blis into dev

commit 34f974d1a83a7d29ba09f67e392d361231fdf99c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 29 17:11:52 2018 -0500

    More tweaks/updates to sandbox/ref99/README.md.

commit 850a8a46c0a569a2652d8c200e5c53b61bcf988d
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue May 29 13:51:21 2018 -0500

    Test all x86_64 configurations*... (#212)
    
    * Add custom SDE cpuid files.
    
    * Set up testing of all x86_64 architectures (except bulldozer) using SDE.
    
    * Update .travis.yml
    
    [ci skip]
    
    * Update do_testsuite.sh
    
    [ci skip]
    
    * Updated .travis.yml with my secret token.
    
    Details:
    - Replaced Devin's temporary secret token with my own, which is used by
      Travis when accessing the Intel SDE via Dropbox.
    
    * Work around CPUID dispatch in glibc/libm by patching ld.so.
    
    * Detect path of loader at runtime.
    
    * Attempt to make SDE run on Travis
    
    * Allow unpatched ld.so if we don't know how to patch it.
    
    I *think* this only happens for older glibc without the multi-arch stuff (e.g. Ubuntu 14.04 on Travis), but who knows?
    
    * Upgrade Travis to gcc-6 and binutils-2.26.
    
    * Try to get Travis to use the right assembler.
    
    * Apparently you need ld-2.26 too.
    
    * Try to also patch ld.so from Ubuntu 14.04.
    
    * Take the nuclear option.
    
    * Account for non-absolute dependencies in ldd output.
    
    * String manipulation fail.
    
    * Update patch-ld-so.py
    
    * Add Zen to SDE testing.
    
    * Removed dead variable from travis/do_testsuite.sh.
    
    Details:
    - Removed 'BLIS_ENABLE_TEST_OUTPUT=yes' from make invocations in
      travis/do_testsuite.sh. This variable is no longer present in the
      BLIS build system (if it ever was?), and therefore has no effect.

commit 42ea02a34e5c144893fe239ae55daef895d92677
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 29 12:48:14 2018 -0500

    Renamed c99 sandbox to ref99.
    
    Details:
    - Renamed sandbox/c99 to sandbox/ref99. I wanted to name the sandbox so
      that it would be thought of as a "reference" sandbox. I kept the "99"
      to differientiate it from future reference sandboxes that may be
      written in another language (such as C++).
    - Updates to sandbox/ref99/README.md.

commit 0e7205ccef50dccd4306cf427a63633396472813
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 29 12:36:13 2018 -0500

    Remove sandbox/.gitkeep now that dir is non-empty.

commit 3a4603858e3819cbd6ed7dd67d0fc0b3f89ed254
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat May 26 15:51:08 2018 -0500

    More README.md updates to sandbox/c99.
    
    Details:
    - Added a section that walks the reader through how to configure BLIS to
      use a gemm sandbox.

commit 2bad97f6bdf4642884d60fc03970549902a54d74
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat May 26 15:31:16 2018 -0500

    Updates to CREDITS, sandbox/c99/README.md.

commit 2b4a447526effa3e847a7e5c15c3758573f12318
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri May 25 18:51:23 2018 -0500

    Initial implementation of c99 "reference" sandbox.
    
    Details:
    - Added a c99 sandbox (in sandbox/c99) to serve as a starting point for
      others looking to experiment with alternative implementations of gemm
      in BLIS. Note that this sandbox implementation is a first draft and
      will be refined over time.
    - Minor updates to Makefile and common.mk to restrict what source files
      get recompiled when sandbox files are touched.
    - Added an initial draft of a README.md in sandbox/c99.

commit 469727d4f8a976d8713afb4d0b6235c322498db0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri May 25 16:17:13 2018 -0500

    Very minor comment updates.

commit 66dbe69a0f9359bf1e39b5672ee365213de2e3ee
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri May 25 15:45:53 2018 -0500

    Converted macros to static funcs in _packm_cntl.h.
    
    Details:
    - Converted various macros in frame/1m/packm/bli_packm_cntl.h (designed
      to access fields of a packm_params_t struct) to static functions.

commit 22deef2f5463a47e3b3c37fc313d17550f10ee06
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu May 24 14:28:55 2018 -0500

    Support alternative gemm implementation sandboxes.
    
    Detail:
    - configure:
      - add support for --enable-sandbox=NAME to configure script, where NAME
        is a subdirectory of a new 'sandbox' directory that contains an
        alternative implementation of gemm. (For now, only implementations of
        gemm may be provided via a sandbox.);
      - add support for C++ compiler. C++ compilers are handled in a manner
        similar to that of C compilers, in that a default search order is
        used, and that CXX is searched for first, if the variable is set. In
        practice, the C++ compiler that is selected should correspond to the
        selected C compiler. (Example: If gcc is selected for C, g++ should
        be selected for C++.) The result of the search is output to config.mk
        via build/config.mk.in. NOTE: The use of C++ in BLIS is still
        hypothetical, but may eventually move to being experimental. This
        support was intended only for use of C++ within a gemm sandbox.
    - build/config.mk.in:
      - define SANDBOX variable containing sandbox subdirectory name.
    - build/bli_config.in:
      - define either of the BLIS_ENABLE_SANDBOX or BLIS_DISABLE_SANDBOX
        macros in bli_config.h.
    - common.mk:
      - include makefile fragments that were propagated into the specified
        sandbox subdirectory;
      - generate different CFLAGS for sandboxes, as well as a separate
        CXXFLAGS variable for sandboxes when C++ source files are compiled;
      - isolate into a single location lists of file suffixes for various
        purposes.
      - reorganized/clean up code related to identifying header files and
        paths.
    - Makefile:
      - generate object filepaths for and compile source code files found in
        sandbox sub-directory;
      - remove makefile fragments placed in sandbox sub-directory (cleanmk);
      - various other cleanups.
    - Added .cc, .cpp, and .cxx to list of suffixes of files to recognize in
      makefile fragments (via build/gen-make-frags/suffix_list).
    - Updated blis.h to conditionally #include bli_sandbox.h (via a new file,
      bli_sbox.h), which each sandbox is assumed to use for any type
      definitions and function prototypes it wishes to export out to blis.h.
    - Conditionally disable bli_gemmnat() implementation in frame/3 when
      BLIS_ENABLE_SANDBOX is defined.

commit 25e3501ed57a0db7f860c88b7199b36049aec12a
Merge: 216a4cb9 5140ee34
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu May 24 13:57:16 2018 -0500

    Merge branch 'master' into dev

commit 5140ee3424c744981a3fed3b5a748ebbfc111388
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed May 23 16:56:14 2018 -0500

    Updated types of bli_is_[un]aligned_to() functions.
    
    Details:
    - Changed the void* arguments of the following static functions:
        bli_is_aligned_to()
        bli_is_unaligned_to()
        bli_offset_past_alignment()
      to siz_t, and the return type of bli_offset_past_alignment() from
      guint_t to siz_t. This allows for more versatile usage of these
      functions (e.g. when aligning both pointers and leading dimension).
    - Updated all invocations of these functions, mostly in kernels/penryn
      but also in kernels/bgq, to include explicit typecasts to siz_t when
      pointer arguments are passed in.
    - Thanks to Devin Matthews for pointing out this potential bug (via issue
      #211).
    - Deleted a few trailing spaces in various penryn kernels.
    - Removed duplicate instances of the words "derived" and "THEORY" from
      various kernel license headers, likely from a malformed recursive sed
      performed long ago.

commit 216a4cb9cb87fa4c93f6ceb6ae90602e5018b305
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri May 18 18:47:03 2018 -0500

    Minor update to flatten-headers.[py|sh] help text.
    
    Details:
    - Fixed a typo and removed some outdated language from the help text of
      flatten-headers.py and flatten-headers.sh.

commit 962a706a6f56ea070ac4683f0af69c7e59af8ecb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri May 18 18:19:40 2018 -0500

    Updated LICENSE file to mention HP Enterprise.
    
    Details:
    - Added HP Enterprise to the LICENSE file. Previously, only the source
      files touched by HPE contained the corresponding copyright notices.
      (This oversight was unintentional.)
    - Updated file-level copyright notices to include a comma, to match
      the formatting used for UT and AMD copyrights.

commit efa43e13effe901ad31e734ac90f027e89473bd9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri May 18 12:20:40 2018 -0500

    More updates to CREDITS and RELEASING files.

commit f94ab97af8e86baf9ee9a9cbaef8bb3712df2e11
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu May 17 17:45:31 2018 -0500

    Update to CREDITS file.

commit 4919b10c005e006a6d818eb8f865f9dbd8aa16df
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu May 17 16:38:49 2018 -0500

    Minor changes to README.md and CONTRIBUTING.md.

commit b89451187e8321b673a1cf7603c8d48028d9d4c8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu May 17 16:23:06 2018 -0500

    README.md update.
    
    Details:
    - Added "Contributing" section with relevant links.

commit af244194e7d76276a1b90fe59f9307dde0429e1d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu May 17 15:38:02 2018 -0500

    Removed explicit critical sec. from bli_memsys.c.
    
    Details:
    - Removed critical sections protecting the initialization/finalization of
      bli_memsys.c. These synchronization mechanisms are no longer needed now
      that BLIS initializes all APIs via pthread_once().

commit 10c9e8f95254d8c6436c4d3cb093fa5544b45c90
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu May 17 15:22:51 2018 -0500

    Cache hardware's arch_t id after querying once.
    
    Details:
    - Added logic to bli_arch.c that will call what was previously the body
      of bli_arch_query_id() only once and then cache the value in a static
      variable local to the file. (Previously, the arch_t associated with
      the hardware/configuration was queried every time bli_arch_query_id()
      was called, which was at least once per level-3 function call. Thanks
      to Devin Matthews for suggesting this feature via issue #175.
    - Added -lpthread to the compile/link command line of the compiler
      invocation that compiles build/detect/config/config_detect.c, which
      prints the string identifying the detected configuration, since it
      is now needed due to new pthread_once() logic in bli_arch.c.
    - Implementation note: I chose to implement this arch_t caching feature
      via pthread_once(), using a separate pthread_once_t variable local to
      the file, rather than calling bli_init_once(). The reason is that I
      did not want to require bli_init() as a prerequisite to this function.
      bli_init() already calls several sub-components, some of which make use
      of bli_arch_query_id(), and therefore it would be easy to fall into a
      circular self-init situation (which usually causes pthreads to hang
      indefinitely).

commit f28a15293890ac6fbceac229fd204dbc9fec6e27
Author: Francisco Igual <figual@ucm.es>
Date:   Thu May 17 09:26:14 2018 +0000

    Fixed clobber list bug in ARMv8 ukernel

commit 2e31dd7852b4d6a9355899cf9659d4b8130461cb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed May 16 17:28:33 2018 -0500

    Inserted missing integer typecasting into ukernels.
    
    Details:
    - Inserted missing safeguards into most microkernels to ensure that the
      integers read by the microkernel's assembly instructions are of the
      appropriate size. In many cases, this bug was going undetected likely
      because the compiler was inserting zero padding before the integers
      in the calling function, allowing the assembly code to read 64-bits
      in a way that did not corrupt the "lower" 32 integer bits with garbage
      in the higher bits. Thanks to Francisco Igual and Devangi Parikh for
      finding this issue.

commit 12dfa9516428b4092554f0ce70b07571d35de222
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed May 16 12:46:57 2018 -0500

    Fixed a bug in determining default integer size.
    
    Details:
    - Fixed a bug that would cause configurations to inadvertantly define
      their integers to be 32 bits when those environments actually call for
      64-bit integers. While either BLIS_ARCH_64 or BLIS_ARCH_32 is defined
      in bli_system.h (based on whether preprocessor macros such as __x86_64
      or __aarch64__ are defined by the environment), bli_system.h was being
      #included *after* bli_config_macro_defs.h, in which the BLIS_ARCH_64
      macro was used to choose an integer type size in the event that
      BLIS_INT_TYPE_SIZE was not already defined by configure via
      bli_config.h. And due to the structure of the cpp code in that file,
      the 32-bit integer case was being chosen. Thanks to Francisco Igual
      and Devangi Parikh for their help in isolating this bug.
    - Moved the #include of hbwmalloc.h and related preprocessor code to
      bli_kernel_macro_defs.h to facilitate the reshuffling of the #include
      for bli_system.h in blis.h.

commit f930cec0f35824c0f9ebbd218614209217d491cb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 15 17:47:08 2018 -0500

    More tweaks to CONTRIBUTING.md.

commit 173e30ff7d293ba31f3fab8ab0c0a695eda3d4fd
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 15 14:48:34 2018 -0500

    Added initial draft of CONTRIBUTING.md file.
    
    Details:
    - Thanks to the Ruby on Rails project for providing a good template off
      of which to build.

commit 6e25e758b444bf725046674e1e64c6a52421749d
Author: Nico Schlömer <nico.schloemer@gmail.com>
Date:   Tue May 15 14:03:20 2018 +0200

    Debian config (#206)
    
    * add debian config
    
    * correct wording in the README

commit fcf6c6a3c87da08a7cdb92b102489b991ef7a644
Author: Alex Arslan <ararslan@comcast.net>
Date:   Mon May 14 18:41:03 2018 -0700

    Fix shared library builds on platforms other than Linux and macOS (#209)
    
    * Fix detection of systems other than Linux and macOS
    
    The way the logic is currently laid out, any platform that isn't Linux
    gets assigned the .dylib shared library extension and the macOS-specific
    compiler flags. This reverses the logic to check for macOS first, and
    have the fallback use the Linux definitions, which apply to most other
    systems as well.
    
    * Use SHLIB_EXT instead of SO_SUF
    
    The former is more standard, as jakirkham pointed out in a comment.

commit 6f7f51048c48f31d691c06451d0fd2cbc453ad03
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon May 14 18:41:56 2018 -0500

    Echo cc_vendor when printing compiler version.
    
    Details:
    - Echo the ${cc_vendor} when informing the user of the compiler's version.
      Previously, the actual ${cc} (which could be a path to the executable)
      was being printed, which has already been printed by that point in the
      configure script.

commit ad67dc4e348b0a381efc057573a6b03cc7e26db0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon May 14 18:35:28 2018 -0500

    Communicate cc, cc_vendor to make via config.mk.
    
    Details:
    - Historically, the compiler selection has happened statically in the
      various make_defs.mk and would only be overriden by setting CC (either
      prior to running configure or as a configure argument). However, in
      the last couple months, configure has evolved to contain rather
      sophisticated compiler detection logic for the purposes of blacklisting
      sub-configurations. It only makes sense that configure now fully take
      over the responsibility of selecting a compiler from the GNU make side
      of the build system. Thanks to Alex Arslan for his help exposing this
      issue.
    - Substitute found_cc into CC in config.mk via configure.
    - Set a new variable, CC_VENDOR, in config.mk via substitution from
      configure, and disable the corresponding CC_VENDOR code in common.mk.
    - Disabled default compiler selection (usually gcc) in the sub-configs'
      various make_def.mk files.

commit 20af119fc97ec6120017a7a5ba5f9aaa920c7640
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon May 14 17:44:58 2018 -0500

    Added README.md to 'config' directory.
    
    Details:
    - Added a brief README.md file to the config directory to redirect those
      who may be exploring the source tree to the ConfigurationHowTo wiki.
      (Included is a very brief explanation of configurations for those who
      don't have time to read the wiki.) Thanks to Nico Schlömer for this
      suggestion.

commit 9dbce16269c3e1f27c7a0d64372cc76aed30dfc1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon May 14 17:04:54 2018 -0500

    Search for 'cc clang gcc' on OpenBSD, FreeBSD.
    
    Details:
    - Swapped gcc and clang in the compiler search list for OpenBSD.
    - Use the same search list for FreeBSD as above.

commit 55ebf24d63128b5fd15b10160485667415a02a55
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon May 14 16:19:08 2018 -0500

    Change compiler search order on OpenBSD.
    
    Details:
    - Set a compiler search list (and order) as a function of the OS detected
      via 'uname -s'. By default, this list and order is 'gcc clang cc' for
      Linux and Darwin (OS X), and any other OS except OpenBSD). On OpenBSD,
      we use 'cc gcc clang' because OpenBSD's default installation of gcc
      (4.2.1) is too old for BLIS. Thanks to Alex Arslan for reporting this
      issue and suggesting a fix.

commit 4fb353bd90e6642c8aeffd1b1e6329f54eee4bb4
Merge: 4b36e85b 8a2857b5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun May 13 17:50:51 2018 -0500

    Merge branch 'master' into dev

commit 8a2857b5e3c633b18c24f2275110437a702a71d0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri May 11 18:42:05 2018 -0500

    Fixed README.md typo; mention 'make check'.

commit 543935c02f9335142d2e485a15f37dbaebe012ed
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri May 11 18:35:32 2018 -0500

    Updated README.md with Ubuntu packages link.
    
    Details:
    - Created a separate section of README.md for external packages, with
      one bullet each for Dave Love's rpms and Nico Schlömer's Ubuntu apt
      packages. Thanks to Dave and Nico for their contributions.

commit af1d8470b56d3b2a1c8513d366d788dddcb84baa
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri May 11 17:49:58 2018 -0500

    Better handling of shared libraries on OS X.
    
    Details:
    - Use the .dylib shared library suffix on OS X (instead of .so in Linux).
    - Link with the -dynamiclib and -install_name options on OS X (instead of
      -shared and -soname in Linux).
    - Determine operating system (e.g. Linux, Darwin) during configure and
      substitute into config.mk.in rather than run 'uname -s' during make.
    - Echo operating system during configure.

commit 4b72a462d7467cf815422aafac7b05037d2e3b13
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu May 10 18:35:38 2018 -0500

    Enable building shared library by default.
    
    Details:
    - Tweaked configure so that the shared library is generated by default.
    - Updated --help text and configure's feedback messages reporting the
      status of the static/shared builds.
    - Changed the order of build product installation so that headers are
      installed last, after libraries and symlinks.

commit b699bb1ff03c6e9baaa054805b4939983ae7145b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu May 10 15:54:17 2018 -0500

    Adopt Linux-like .so versioning at install-time.
    
    Details:
    - Changed the naming conventions used for installed libraries and
      symlinks to more closely mirror patterns used by typical GNU/Linux
      libraries. Whereas previously static and shared libraries were
      installed and symlinked as follows:
    
        (library) libblis-0.3.2-15-haswell.a
        (library) libblis-0.3.2-15-haswell.so
        (symlink) libblis.a -> libblis-0.3.2-15-haswell.a
        (symlink) libblis.so -> libblis-0.3.2-15-haswell.so
    
      we now use the following naming conventions:
    
        (library) libblis.a
        (symlink) libblis.so -> libblis.so.0.1.2
        (symlink) libblis.so.0 -> libblis.so.0.1.2
        (library) libblis.so.0.1.2
    
      where 0.1.2 indicates shared library major, minor, and build versions
      of 0, 1, and 2, respectively. The conventional version string can
      still be queried by linking to the library in question and then calling
      bli_info_get_version_str(). (The testsuite binary does this
      automatically at startup.)
    - Added logic to common.mk to set the soname field in the shared library
      via the -soname linker flag.
    - Added a 'so_version' file to the top-level directory containing two
      lines. The first line specifies the .so major version number, and the
      second line specifies the minor and build version numbers joined with
      a '.'. This file is read by configure and those values substituted
      into build/config.mk.in to define SO_MAJOR, SO_MINORB, and SO_MMB
      variables.

commit fc2d9ec6bf46f6e5b19d196208415ce433e95b10
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed May 9 15:19:28 2018 -0500

    Tweaks to top-level clean and distclean targets.
    
    Details:
    - Moved the removal of bli_config.h from cleanh to distclean.
    - Removed cleantest as a dependency of clean.

commit bf0350305971e3991861b5117a13fda31ff97b6d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 8 16:49:22 2018 -0500

    Renamed (shortened) a few build system variables.
    
    Details:
    - Renamed the following variables in config.mk (via build/config.mk.in):
        BLIS_ENABLE_VERBOSE_MAKE_OUTPUT -> ENABLE_VERBOSE
        BLIS_ENABLE_STATIC_BUILD        -> MK_ENABLE_STATIC
        BLIS_ENABLE_SHARED_BUILD        -> MK_ENABLE_SHARED
        BLIS_ENABLE_BLAS2BLIS           -> MK_ENABLE_BLAS
        BLIS_ENABLE_CBLAS               -> MK_ENABLE_CBLAS
        BLIS_ENABLE_MEMKIND             -> MK_ENABLE_MEMKIND
      and also renamed all uses of these variables in makefiles and makefile
      fragments. Notice that we use the "MK_" prefix so that those variables
      can be easily differentiated (such as via grep) from their "BLIS_" C
      preprocessor macro counterparts.
    - Other whitespace changes to build/config.mk.in.
    - Renamed the following C preprocessor macros in bli_config.h (via
      build/bli_config.h.in):
        BLIS_ENABLE_BLAS2BLIS        -> BLIS_ENABLE_BLAS
        BLIS_DISABLE_BLAS2BLIS       -> BLIS_DISABLE_BLAS
        BLIS_BLAS2BLIS_INT_TYPE_SIZE -> BLIS_BLAS_INT_TYPE_SIZE
      and also renamed all relevant uses of these macros in BLIS source
      files.
    - Renamed "blas2blis" variable occurrences in configure to "blas", as
      was done in build/config.mk.in and build/bli_config.h.in.
    - Renamed the following functions in frame/base/bli_info.c:
        bli_info_get_enable_blas2blis() -> bli_info_get_enable_blas()
        bli_info_get_blas2blis_int_type_size()
                                        -> bli_info_get_blas_int_type_size()
    - Remove bli_config.h during 'make cleanh' target of top-level Makefile.

commit 4b36e85be9b516b4089b24768f881dd976668997
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 8 14:26:30 2018 -0500

    Converted function-like macros to static functions.
    
    Details:
    - Converted most C preprocessor macros in bli_param_macro_defs.h and
      bli_obj_macro_defs.h to static functions.
    - Reshuffled some functions/macros to bli_misc_macro_defs.h and also
      between bli_param_macro_defs.h and bli_obj_macro_defs.h.
    - Changed obj_t-initializing macros in bli_type_defs.h to static
      functions.
    - Removed some old references to BLIS_TWO and BLIS_MINUS_TWO from
      bli_constants.h.
    - Whitespace changes in select files (four spaces to single tab).

commit 7e5648ca150757b874f6823da832f3798c40b9f9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon May 7 18:59:19 2018 -0500

    Add configure support for --libdir, --includedir.
    
    Details:
    - Added support for two new configure options: --libdir and --includedir.
      They specify the precise install directories for libraries and header
      files, respectively, and override any location implied by the --prefix
      option (including the default install prefix, if --prefix was not
      given). Thanks to Nico Schlömer for suggesting this via issue #195.
    - Removed the INSTALL_PREFIX definition/anchor from build/config.mk.in
      and replaced it with corresponding definitions/anchors for libdir and
      includedir.
    - Updated top-level Makefile to use the new variables, INSTALL_LIBDIR
      and INSTALL_INCDIR, instead of INSTALL_PREFIX (which is now no longer
      needed by make).
    - Set default sane values for INSTALL_LIBDIR and INSTALL_INCDIR in
      common.mk when configure has not been run, as is already done for
      DIST_PATH. This is to safeguard against statements in the top-level
      Makefile that use 'find' to locate old libraries and headers for the
      uninstall targets, which run regardless of make target. Without setting
      INSTALL_LIBDIR and INSTALL_INCDIR, those variables are empty and the
      'find' ends up looking at '/', which is obviously not what we want.
      (Also enclosed those definitions in an IS_CONFIGURED guard so that they
      won't get evaluated unless configure has been run.)
    - Rearranged "ifeq ($(IS_CONFIGURED),yes)" conditionals in Makefile to
      reduce occurrences and separated "local" and top-level components of
      cleanblastest and cleanblistest targets to improve readability.
    - Adjusted out-of-tree builds so that they are no longer oblivious to
      the .git directories, if present, and thus now properly augment version
      strings with the appropriate patch number.
    - Include missing version string in 'configure --help' output.

commit b09e4e8852a6c42895910e3bcb9041124dc8bf9f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon May 7 14:37:50 2018 -0500

    Allow 'make clean' and friends without configuring.
    
    Details:
    - Modified top-level Makefile so that a user can run 'make distclean',
      'make clean', or any of the other clean-related targets prior to
      running configure (or after a previous 'make distclean'). Thanks to
      Nico Schlömer for suggesting this via issue #197.
    - Made the cleanblastest and cleanblistest more comprehensive in that
      they now clean out build products that would have resulted from local
      compilation (ie: builds performed within the 'blastest' or 'testsuite'
      directories).
    - Added "cc" to list of expected compiler "vendors" since the CC variable
      seems to automatically be set to "cc" on Ubuntu 16.04 (which is just an
      alias to gcc).
    - Comment update to build/config.mk.in.

commit 35c5a1449c3efe0b2ec43cdefcfdf00e71828149
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon May 7 12:04:57 2018 -0500

    No longer update version file during configure.
    
    Details:
    - Recycled the core functionality of build/update-version-file.sh into a
      function in configure, disabling the updating of the 'version' file in
      the process. Instead of writing the patched version string back to the
      version file and then reading it again from within configure, the
      patched version string is now saved directly to a variable in the main()
      function in configure. This will prevent developers from accidentally
      committing configure-induced changes to the version file in between
      releases.

commit 8adb2f919b62da4a2885ae04a10925e0e6a2e304
Author: Mathieu Poumeyrol <kali@users.noreply.github.com>
Date:   Sun May 6 19:58:16 2018 +0200

    Some cross compilations fixes (#198)
    
    * cross-compilation fixes
    * add doc ranlib variable
    * icc support -dumpversion, posix compatible test, plus one stupid mistake
    * retab
    * revert version as requested

commit 89acd9ebe516eeb97006dba344354bfc98826645
Merge: 4cff432d 0557eba7
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed May 2 12:53:35 2018 -0500

    Merge branch 'amd'

commit 4cff432d707891ada705b039a7e043558bbf3c51
Author: Nisanth M P <31736542+nisanthmpamd@users.noreply.github.com>
Date:   Wed May 2 23:20:42 2018 +0530

    AMD specific optimizations for target 'zen' (#194)
    
    Re-enabled AMD-specific optimizations for zen.
    
    Details:
    - Re-enabled Zen-specific cache blocksizes for 'zen' sub-configuration.
    - Re-enabled small matrix gemm optimization for 'zen'.
    - These were both temporarily disabled during a previous merge simply due to lack of Zen hardware for testing.

commit 8eda5fe7f678b413cb274bd84716995a7d0b87a9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed May 2 12:20:37 2018 -0500

    Typo fix in README.md.

commit 0557eba78f5fcf28f0f039f28da79498ffde848c
Author: Nisanth M P <nisanth.padinharepatt@amd.com>
Date:   Mon Mar 19 12:49:26 2018 +0530

    Re-enabling the small matrix gemm optimization for target zen
    
    Change-Id: I13872784586984634d728cd99a00f71c3f904395

commit df78ceb3d6f33a27fe69017854405edaea7c40e5
Author: Nisanth M P <nisanth.padinharepatt@amd.com>
Date:   Mon Mar 19 11:34:32 2018 +0530

    Re-enabling Zen optimized cache block sizes for config target zen
    
    Change-Id: I8191421b876755b31590323c66156d4a814575f1

commit 5e515f9a76f4aaf43dc21315a34d797726ca8069
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 1 13:44:10 2018 -0500

    Tweaked new language in README.md.

commit 1ddd9e316ad5024af8b606dfcebd1e7d587a130f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 1 13:36:28 2018 -0500

    Added link to Dave Love's Fedora Copr page.
    
    Details:
    - Added a blurb to README.md advertising Dave Love's Copr homepage,
      which contains rpm packages for RHEL/Fedora-like distributions.

commit 078a852f738c66c6468bd5e64b06467edc9057fd
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 30 16:15:26 2018 -0500

    Minor tweaks to top-level 'make clean' target.
    
    Details:
    - Execute 'cleanh' target as part of 'clean'
    - Remove cblas.h file from 'include/<configname>/' as part of 'cleanh'
      target.
    - Updated the echoed (non-verbose) text for uniformity.

commit 75d0d1057dda69c655bd1cd8f791cb39b54d99b8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 30 14:57:33 2018 -0500

    Renamed various datatype-related macros/functions.
    
    Details:
    - Renamed the following macros in bli_obj_macro_defs.h and
      bli_param_macro_defs.h:
      - bli_obj_datatype()                 -> bli_obj_dt()
      - bli_obj_target_datatype()          -> bli_obj_target_dt()
      - bli_obj_execution_datatype()       -> bli_obj_exec_dt()
      - bli_obj_set_datatype()             -> bli_obj_set_dt()
      - bli_obj_set_target_datatype()      -> bli_obj_set_target_dt()
      - bli_obj_set_execution_datatype()   -> bli_obj_set_exec_dt()
      - bli_obj_datatype_proj_to_real()    -> bli_obj_dt_proj_to_real()
      - bli_obj_datatype_proj_to_complex() -> bli_obj_dt_proj_to_complex()
      - bli_datatype_proj_to_real()        -> bli_dt_proj_to_real()
      - bli_datatype_proj_to_complex()     -> bli_dt_proj_to_complex()
    - Renamed the following functions in bli_obj.c:
      - bli_datatype_size()                -> bli_dt_size()
      - bli_datatype_string()              -> bli_dt_string()
      - bli_datatype_union()               -> bli_dt_union()
    - Removed a pair of old level-1f penryn intrinsics kernels that were no
      longer in use.

commit 01c4173238baf08e7f6700a3f91a2ea58cca50c1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Apr 28 14:07:34 2018 -0500

    CHANGELOG update (0.3.2)

commit 2fb440876690bdcec0c11a30e2b33ad100bab529 (tag: 0.3.2)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Apr 28 14:07:31 2018 -0500

    Version file update (0.3.2)

commit cdf041ddadd8725e578e2f59f37ae341f26655af
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Apr 28 14:05:00 2018 -0500

    Use config.mk instead of common.mk in bump-version.sh.
    
    Details:
    - Fixed inadvertent targeting of common.mk when testing whether configure
      had already been run, rather than config.mk.

commit 6ded8f9f0364b3c07255e2532ada3eeb2ed2a715
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Apr 28 14:01:29 2018 -0500

    Account for recent 'make distclean' in bump-version.sh.
    
    Details:
    - Added logic to build/bump-version.sh that will run './configure auto'
      if 'common.mk' is not present (usually because 'make distclean' was run
      recently).

commit 7c16fdce433f5dea0e83d5047553c955d8e46fd2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Apr 28 13:50:55 2018 -0500

    Fixed typo in RELEASING file.

commit 5e5ca4984fcf6d72d3036c338bb9cdc64520a325
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Apr 28 13:48:01 2018 -0500

    README updates.
    
    Details:
    - Updates to the top-level README files in the top-level directory as
      well as the 'examples/oapi' directory.

commit 627b045e301defea6770dc5b64e1110cbec25153
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Apr 27 18:11:19 2018 -0500

    Added an example of using transposition with gemm.
    
    Details:
    - Added an example to examples/oapi/8level3.c to show how to indicate
      transposition when performing a gemm operation.

commit 13a0eadc69d72933e322901f5b44944834e3c787
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Apr 27 18:00:07 2018 -0500

    Added more transposition/conjugation examples.
    
    Details:
    - Added code to examples/oapi/5level1m.c that demonstrates transposing
      (and conjugate-transposing) unstructured matrices.
    - Comment updates to 6level1m_diag.c to maintain consistency with new
      examples in 5level1m.c.

commit 5606cd8881e75264a96af45dc8ea1905bab054f5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Apr 27 17:13:10 2018 -0500

    Added utility module to examples/oapi.
    
    Details:
    - Added a new code example file to examples/oapi demonstrating how to use
      various utility operations.
    - Comment updates to other example files.
    - README updates.

commit ff26c94c6486374c709f93c6965ea18903bd6a18
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Apr 27 12:31:34 2018 -0500

    Added missing gcc version constraint for knl.
    
    Details:
    - Previously forgot to add explicit enforcement of a minimum gcc version
      in configure script when 'knl' sub-configuration is requested.
    - Comment updates to configure.

commit 4d97574e477b3e55ddbb6044b0542a92cd9bab30
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 24 18:48:09 2018 -0500

    Added object API example code.
    
    Details:
    - Added an 'examples' directory at the top level.
    - Added an 'oapi' subdirectory in 'examples' that contains a tutorial-like
      sequence of example code demostrating the core functionality of BLIS's
      object-based API, along with a Makefile and README. Thanks to Victor
      Eijkhout for being the first to suggest including such code in BLIS.

commit d6ab25a3232aa52b9b855088fb4b0b46ff2c00c8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 24 18:43:03 2018 -0500

    Add setijm, getijm operations.
    
    Details:
    - Added bli_setgetijm.c, which defines bli_setijm(), bli_getijm(), and
      related functions that can be used to read and write individual
      elements of an obj_t.
    - Defined a new function, bli_obj_create_conf_to(), in bli_obj.c that will
      create a new object with dimensions conformal to an existing object.
      Transposition and conjugation states on the existing object are ignored,
      as are structure and uplo fields.
    - Defined a new function, bli_datatype_string(), in bli_obj.c that returns
      a char* to a string representation of the name of each num_t datatype.
      For example, BLIS_DOUBLE is "double" and BLIS_DCOMPLEX is "dcomplex".
      BLIS_INT is included (as "int"), but BLIS_CONSTANT is not, and thus is
      not a valid input argument to bli_datatype_string().
    - Added calls to bli_init_once() to various functions in bli_obj.c, the
      most important of which was bli_obj_create_without_buffer().
    - Removed unintended/extra newline from the end of printv output.
    - Whitespace changes to
      - frame/base/bli_machval.c
      - frame/base/bli_machval.h
      - frame/0/copysc/bli_copysc.c
    - Trivial changes to README.md and common.mk.

commit a731a428f7fc02fd6ab4f953ead828c1d06fb5a1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 17 16:44:55 2018 -0500

    Another README.md update.

commit c734ee928a824b27d280a9a67b1b4bc8423d5795
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 17 16:40:05 2018 -0500

    README.md update.

commit 03ecad372d8eb603ee905a7b944d0544a813460a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 17 14:16:59 2018 -0500

    Added RELEASING file.
    
    Details:
    - Added a file named 'RELEASING' that contains basic notes on how to
      create a new version/release of BLIS. This is mostly just a reminder
      to myself, but also may become useful if/when others take over
      development and administration of the project.

commit 24b3c3149ce66546b9a1afc2cc794a637a86aa60
Merge: 60366a3f 817b67c0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 16 18:49:38 2018 -0500

    Merge branch 'dev' of github.com:flame/blis into dev

commit 60366a3faba4e60cee85c3b87a3f69625f4b9026
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 16 18:46:21 2018 -0500

    Updates to knl kernels and related code.
    
    Details:
    - Imported the 24x16 knl sgemm microkernel (and its corresonding spackm
      kernel) from TBLIS and enabled its use in the knl sub-config. Also
      Added sgemm microkernel prototype to bli_kernels_knl.h.
    - Updated dgemm and dpackm microkernels from TBLIS, which included an
      important change regarding the offsets array (changed from extern
      declaration to static declaration/definition).
    - Activated use of level-1v and -1f zen kernels in skx and knl
      sub-configs.
    - Removed some old macros no longer needed in bli_family_skx.h now that
      libmemkind support exists in configure.
    - Moved bli_avx512_macros.h to frame/include and adjusted #includes in
      skx and knl kernels accordingly.
    - Moved unused kernels in kernels/knl/3 to kernels/knl/3/other
      directory.
    - Fixed a minor bug in the 'make' output per compile when verboseness
      is not turned on. The rule-generating function 'make-kernel-rule' was
      previously passing in the name of the config, rather than the name of
      the kernel set returned by get-config-for-kset, which could give
      misleading information to the user when the kconfig_map mapped a
      kernel set to a sub-configuration that did not share the same name.
      (This didn't affect the CFLAGS that were actually used.)
    - Updated test/3m4m/Makefile, removing acml targets and renaming the
      remaining targets.

commit 817b67c01752e0ca8fe230bb8ad23afc7bd0f64e
Merge: 67c9c2f8 2b7108a8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 16 14:06:26 2018 -0500

    Merge branch 'dev' of github.com:flame/blis into dev

commit 67c9c2f86d5ef2accc439b21581d73d82754a2e3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 16 14:03:12 2018 -0500

    Retired haswell gemm microkernels.
    
    Details:
    - Moved microkernels in kernels/haswell/3 to kernels/haswell/3/old. These
      microkernels were no longer being used and only sowed confusion to
      anyone inspecting the repository without being fully cognizant of the
      build system and how it works (and sometimes even to those who wrote
      the build system). Note that the haswell configuration currently
      employs the zen microkernels.

commit 2b7108a8ef8ce958b3acad028ff07c85ff97fd63
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 16 12:35:53 2018 -0500

    Minor updates to test driver makefiles.
    
    Details:
    - Cleaned up and homogenized the various test driver Makefiles in
      testsuite and test directories.
    - Very minor updates to test driver code.

commit 9f56df95570a24587b910b169f342bd356ccbfb6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Apr 11 14:51:36 2018 -0500

    Trivial tweaks to configure blacklisting output.
    
    Details:
    - Updated output of information vis-a-vis configuration blacklisting.

commit f56481efebd9a7785c0618f3a12c0bec36f46333
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 10 19:02:21 2018 -0500

    Cleaned up assembler version query on OS X.
    
    Details:
    - Swiched from querying version of 'objdump' to 'as' (e.g. the
      assembler).
    - Fixed the outputting of the version of 'as' on OS X, which required
      this beauty:
        ...=$(as -v /dev/null -o /dev/null 2>&1)
    - Only add sub-configs to blacklist if the sub-config hasn't already
      been added.

commit 088c474e629535affbe111f141f895af50d109be
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 10 18:09:56 2018 -0500

    Added support for blacklisting via the assembler.
    
    Details:
    - Added logic to configure that attempts to assemble various small files
      containing select instructions designed to reveal whether binutils
      (specifically, the assembler) supports emitting those instruction sets.
      This information provides additional opportunities to blacklist sub-
      configurations that are unsupported by the environment. Thanks to Devin
      Matthews for pointing me towards a similar solution in TBLIS as an
      example.
    - Various other cleanups in configure.
    - Reorganized the detection code in the 'build' directory, bringing the
      "auto-detect" configuration detection, libmemkind detection, and new
      instruction set detection codes into a single new subdirectory named
      'detect'.

commit 78a24e7dada52a3582f8488795bd1a44993989d9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 9 17:02:13 2018 -0500

    Updated bli_avx512_macros.h in knl and skx configs.
    
    Details:
    - Downloaded updated version of bli_avx512_macros.h from TBLIS [1] in
      attempt to address issue #192.
      [1] https://github.com/devinamatthews/tblis/

commit 388f64d6ade14caa4a6c286845ad2d565378b2bb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 9 15:33:10 2018 -0500

    Fixed failure to honor CC= argument to configure.
    
    Details:
    - Fixed a failure to observe the value of CC when selecting the compiler
      in configure. Thanks to Devangi Parikh for reporting this bug.
    - The semantics now also work for the CC environment variable. That is,
      if CC is set prior to running configure, that value is used, but will
      be overridden by specifying the CC= argument to configure. If the CC
      environment variable is not set, the CC= value is used. If neither the
      environment variable nor CC= are specified, then the choice is made
      internally to configure: first attempting to find gcc, then clang, and
      then cc.

commit 45fbe66b3e2ab92f0b4fdf437d57c5d06603803d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 9 14:01:08 2018 -0500

    Fixed libmemkind dependency for x86_64.
    
    Details:
    - Removed some old conditional code in config/knl/make_defs.mk that
      added -lmemkind to LDFLAGS if DEBUG_TYPE was not 'sde' and inserted
      code into common.mk that affirmatively filters out -lmemkind from
      LDFLAGS if DEBUG_TYPE is 'sde'. (Thanks to Dave Love for reporting
      this issue.) Other minor cleanups to neighboring code in common.mk.
    - Updated CRVECFLAGS in knl/make_defs.mk to be based on -march=knl,
      and then AVX-512 functionality is manually removed via various
      -mno-avx512* flags. Also, make the setting of CRVECFLAGS conditional
      on CC_VENDOR. Similar change to skx/make_defs.mk.
    - Comment/whitespace updates.

commit ca982148b3b419db063cad2fa74376ec383a5c80
Author: dnp <devangiparikh@gmail.com>
Date:   Sun Apr 8 21:27:10 2018 -0500

    Fixed bug in SKX sgemm microkernel. Modified SKX dgemm mircokernel to be consistent with the sgemm microkernel

commit bd0276752ccdd56ff897b1a5ae022f2ffe6e0b38
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Apr 6 18:51:43 2018 -0500

    Track separate ref kernel flags for each sub-config.
    
    Details:
    - Renamed CVECFLAGS variables in sub-configurations' make_defs.mk files
      to CKVECFLAGS.
    - Added default defintions of two new make variables to most sub-
      configurations' make_defs.mk files--CROPTFLAGS and CRVECFLAGS--
      which correspond to reference kernel analogues of the CKOPTFLAGS
      and CKVECFLAGS, which track optimization and vectorization flags for
      optimized kernels. Currently, two sub-configurations (knl and skx)
      explicitly set CRVECFLAGS to non-default values (using AVX2 instead of
      AVX-512 for reference kernels. Thanks to Jeff Hammond, whose feedback
      prompted me to make this change (issue #187).
    - Changed common.mk so that the get-refkern-cflags-for function returns
      the flags associated with the given sub-configuration's CROPTFLAGS
      and CRVECFLAGS (instead of CKOPTFLAGS and CKVECFLAGS).

commit b9aebce19480448817373e2df2b36bd090eae41a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Apr 6 18:37:33 2018 -0500

    De-verbosify makefile fragment generation.
    
    Details:
    - Changed from -v1 to -v0 when calling gen-make-frag.sh from configure.
      The directory-by-directory recursive output didn't add much value to
      the user, so now we just echo a line for each top-level directory into
      which we will recurse (e.g. 'config', 'ref_kernels', 'frame', etc.).
      This also helps keep more interesting information (from earlier in the
      execution of configure) from scrolling out of the terminal window.

commit b549b91f26948991e13364f1f26a878da0f43aa0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Apr 6 16:31:33 2018 -0500

    Added 64-bit integer support to BLAS test drivers.
    
    Details:
    - Updated the build system and BLAS test drivers to use 64-bit integers
      when BLIS is configured for 64-bit integers in the BLAS layer. Also
      updated blastest/Makefile accordingly. Thanks to Dave Love for
      reporting the need for this feature.
    - Added a 'check' target to blastest/Makefile so that the user can see
      a summary of the tests.
    - Commented out the initial definition of INCLUDE_PATHS in common.mk,
      which was used pre-monolithic header, back when BLIS needed paths to
      *all* headers, rather than just a select few. This line is no longer
      needed since the value of INCLUDE_PATHS is overwritten by a later
      definition limited to only the header paths that are needed now.

commit d39fa1c04265869bdf8b6f453076359eec2f3c59
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 5 19:38:35 2018 -0500

    Adjusted CFLAGS used to compile bli_cntx_ref.c.
    
    Details:
    - Removed CKOPTFLAGS and CVECFLAGS from the set of CFLAGS used to
      compile bli_cntx_ref.c for each configuration. This is necessary
      because the file defines functions like bli_cntx_init_skx_ref(),
      which are called during BLIS's initialization of the global kernel
      structure, potentially being executed by an architecture that lacks
      the instruction set used to compile the kernels for, in this example,
      skx, which would lead to an illegal instruction error. Thanks to
      Dave Love for reporting this issue.
    - Further adjusted CFLAGS used when compiling code in the 'config'
      directory (e.g. bli_cntx_init_skx.c) as well as code in 'frame' so
      as to avoid the aforementioned issue.

commit 08b123084d35680beab379012f8f5a5a8b44a443
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 5 14:25:39 2018 -0500

    Added color-coding to 'make check' output.
    
    Details:
    - Added color coding to output of check-blistest.sh, check-blastest.sh
      scripts. Success messages are coded green and failure are coded red.
      This helps draw the eye toward those messages as the 'make checkblis',
      'make checkblis-fast', and 'make checkblas' targets are executed.
    - Changed top-level Makefile so that execution will not halt if
      'checkblis', 'checkblis-fast', or 'checkblas' targets fail, which
      means that the second of the two tests (BLIS and BLAS) run by
      'make check' will run even if the first test fails.

commit c9e4d7db7410b03c1ffe8c9727e9f1b2ba7fecfe
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Apr 4 17:13:15 2018 -0500

    CHANGELOG update (0.3.1)

commit 1f28d7c86e17730f05bd239c8e8d67e3e7510a4f (tag: 0.3.1)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Apr 4 17:13:15 2018 -0500

    Version file update (0.3.1)

commit e6cc9ee26bcf0450f1120d5d12985b04d9fb8516
Merge: 786d15c5 3c91c7ae
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Apr 4 16:08:18 2018 -0500

    Merge branch 'dev' of github.com:flame/blis into dev

commit 786d15c5ef09f1f647b126b63d57e76d5810c58e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Apr 4 16:06:47 2018 -0500

    Added skx, knl to x86_64 configuration family.
    
    Details:
    - Added 'skx' and 'knl' sub-configurations to the 'x86_64' configuration
      family in the config_registry file.
    - Added logic to configure that avoids committing certain sub-configs to
      the configuration/kernel registries if those sub-configs cannot be
      handled properly by the chosen compiler. (This was modeled after
      similar logic in TBLIS's configure; thanks to Devin Matthews for
      pointing this out.) First, the compiler and its version are inspected
      and, based on the results, certain configurations are added to a
      "blacklist". Then, as the configuration registries are being created,
      configurations and/or kernels that match items in the blacklist are
      skipped over and not commited to the registries. Under certain
      circumstances, omitting a blacklisted configuration will indirectly
      invalidate other configurations due to the loss of availability of
      the original blacklisted configuration's kernel set. This additional
      indirect blacklist is also accounted for.
    - Added output to the beginning of configure that echos information
      about the chosen compiler as well as the configurations that are
      blacklisted and must be stripped from the registries.
    - Various other cleanups in configure, especially with respect to
      explicitly declaring local variables in functions.
    - Comment updates to config/zen/make_defs.mk regarding choice of -march
      flags based on compiler version.

commit 3c91c7aebafb446a2582267beb3b22c8bb475b3b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 2 12:40:25 2018 -0500

    Fixed 64b type mismatch warning in cblas_xerbla.c.
    
    Details:
    - Fixed a compiler warning concerning a type mismatch between the
      format specifier of the printf() call in cblas_xerbla.c and its
      corresponding (info) argument. The warning manifested when the CBLAS
      layer was enabled and the BLAS/CBLAS integer type siwas is set to 64
      (the default is 32). The warning was fixed by changing the specifier
      from %d to %jd and typecasting the argument to intmax_t. Thanks to
      Dave Love for reporting this issue and submitting the patch.

commit 71eaf449a812fe2bd640d21513ec83974b2edb45
Merge: 6a628184 ae9a5be5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 27 17:21:43 2018 -0500

    Merge branch 'dev'

commit ae9a5be56d6f9b87278d6032154d2dcf3fb7d54f
Author: dnp <devangiparikh@gmail.com>
Date:   Tue Mar 27 17:01:23 2018 -0500

    Fixed bug in skx sgemm microkernel

commit 3f02af0905b1e2e2e065862f8afe5e9a52f282b2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Mar 26 17:40:04 2018 -0500

    Row storage optimizations to zen dotxf kernels.
    
    Details:
    - Split the main loop bodies of zen's [sd]dotxf kernels into two cases:
      one to handle a column-stored matrix A and one to handle a row-stored
      matrix A. This allows vector instructions to be employed even if A is
      stored by rows (and A^T appears stored as columns). Both storage cases
      use a common edge case loop. Thanks to Devin Matthews for this idea
      and for prototyping the change needed for sdotxf kernel.

commit 679dcc331dd870ec680e135a3fb65ffa6e3a91c2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Mar 26 15:35:17 2018 -0500

    Make k_iter/k_left uint64_t in bulldozer fma ukrs.
    
    Details:
    - Changed the declaration of k_iter and k_left for d, c, z microkernels
      from dim_t to uint64_t. This is needed to ensure compatibility with
      the movq instruction used to load the value into registers. This
      change should have been made a long time ago, but for some reason
      only recently began showing up via Travis CI.

commit 6a628184f6938673440e4cdd4fed0208c51fd1f9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Mar 26 14:48:16 2018 -0500

    Fixed a memkind-related compile-time bug on knl.
    
    Details:
    - Fixed a compile-time error that occurred due to the fact that
      BLIS_ENABLE_MEMKIND, defined in bli_config.h, was not being defined
      soon enough to be used in bli_system.h where it is needed to determine
      whether hbwmalloc.h should be #included. bli_system.h is now included
      after bli_config.h (and bli_config_macro_defs.h). Thanks to Dave Love
      for reporting this issue.
    - Tweaked the language used by configure to echo the status of the
      --with[out]-memkind option.

commit e2192a8fd58ec3657434ddd407033e097edad8f4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Mar 23 12:53:48 2018 -0500

    Removed vzeroupper intrinsics from zen kenels.
    
    Details:
    - Fixed a bug in the zen (also used by haswell) dotxf kernels whereby a
      vzeroupper instruction destoryed part of the intermediate result
      stored by the vdpps instructions that came right before. (The
      vzeroupper instrinsic was removed.)
    - Removed remaining vzeroupper instrinsics from other zen kernels.
      Previously, the vzeroupper instructions were included because BLIS is
      typically compiled with -mfpmath=sse. But it was brought to my
      attention that inserting these vzeroupper instructions is unnecessary
      for our purposes, since (a) -mfpmath=sse results in VEX-encoded scalar
      code rather than literal SSE instructions, and (b) compilers already
      (likely) insert vzeroupper instructions where necessary. Thanks to
      Devin Matthews for zeroing in on the dotxf bug.
    - Removed -malign-double from bulldozer make_defs.mk. This alignment
      was already happening by default since bulldozer is an x86_64 system.

commit 22289ad23cd10b81451ce82f60d84b5f97e7fd85
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Mar 22 18:21:30 2018 -0500

    Added build system support for libmemkind.
    
    Details:
    - Added support for libmemkind to configure. configure attempts to
      detect the presence of libmemkind by compiling a small program
      containing #include <hbwmalloc.h> and a call to hbw_malloc(). If
      successful, it is assumed that libmemkind is present and available.
      If present, use of libmemkind is enabled by default, and otherwise
      use is disabled by default. If libmemkind is present, the user may
      explicitly disable use of the library by running configure with the
      --without-memkind option. Furthermore, a configuration may disable
      libmemkind, perhaps conditional on some aspect of the build system,
      by including -DBLIS_DISABLE_MEMKIND in the configuration's CPPROCFLAGS
      make variable and setting the BLIS_ENABLE_MEMKIND makefile variable,
      set in config.mk, to 'no'. (The knl configuration makes use of this
      latter feature; see below.)
    - If enabled at configure-time, bli_system.h will #include <hbwmalloc.h>
      and bli_kernel_macro_defs.h will define BLIS_MALLOC_POOL and
      BLIS_FREE_POOL to use hbw_malloc() and hbw_free(), respectively.
    - Deprecated explicit use of BLIS_NO_HBWMALLOC in
      config/knl/bli_family.knl.h and replaced use of -DBLIS_NO_HBWMALLOC in
      config/knl/make_defs.mk with -DBLIS_DISABLE_MEMKIND, which overrides
      (#undefs) the definition of BLIS_ENABLE_MEMKIND in bli_system.h, if it
      would otherwise be defined. Also, set the BLIS_ENABLE_MEMKIND makefile
      variable to 'no'.
    - common.mk now adds libmemkind to LDFLAGS if libmemkind is enabled.

commit 7dc40eafdd9af3e8c4519a8d1b04d25830b4ca7a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Mar 21 18:39:16 2018 -0500

    Updates to top-level and test driver Makefiles.
    
    Details:
    - Added logic to common.mk that will choose a BLIS library against which
      to link (LIBBLIS_LINK). The default choice is the static (.a) library;
      the shared (.so) library is chosen only if the shared library build was
      enabled and the static one was disabled.
    - Updated the various test driver Makefiles to reference this common,
      pre-chosen library against which to link. (Previously, these drivers
      unconditionally linked against the static library and would have
      failed if the static library build was disabled at configure-time.)
    - Renamed many of the variables in common.mk and the top-level Makefile
      so that variables relating to the libblis.[a|so] files, including
      paths to those files, begin with "LIBBLIS".
    - Shuffled around some of the library definitions from the top-level
      Makefile to common.mk.
    - Renamed BLIS_ENABLE_DYNAMIC_BUILD to BLIS_ENABLE_SHARED_BUILD, and
      the @enable_dynamic@ anchor to @enable_shared@ in build/config.mk.in
      and in configure.
    - A few other cleanups in the top-level Makefile.

commit 97e1eeade3c51df1bae574a9bc1da34b05bf2bd3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Mar 21 15:47:11 2018 -0500

    Added input.operations.fast file for 'make check'.
    
    Details:
    - Added an 'input.operations.fast' file to testsuite directory to go
      along with the 'input.general.fast' file used by the 'make check'
      target in the top-level Makefile. This will allow the "fast" check
      to prune operations and/or parameter combinations from the test
      space in order to save time.
    - Currently, input.operations.fast prunes trmm3 and all transposition
      and conjugation parameters from the level-3 test space.
    - Reduced problem size tested in input.general.fast to 100 and disabled
      testing of 1m method.

commit c441caa95aabe69f54e2160eb67bf4ca76a66c34
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 20 17:56:02 2018 -0500

    README update.
    
    Details:
    - Minor updates to README.md.
    - Minor change to blastest/Makefile.

commit 6fe018eb4ac8c16f2edc916c24f5994848017b7f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 20 15:35:45 2018 -0500

    Added .gitkeep file to blastest/obj.
    
    Details:
    - Added an empty file named '.gitkeep' to blastest/obj/ so that git will
      track the otherwise empty directory. (This is already done for the BLIS
      testsuite in testsuite/obj.)

commit 0e6d000db9291342913dc5f8590a28c67bbcbc95
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 20 15:08:43 2018 -0500

    Updated .gitignore to ignore BLAS test out.* files.

commit 40c040a31d96fbadff11f761d0cad1ef03ef2cc5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 20 14:33:50 2018 -0500

    Fixes to .travis.yml.
    
    Details:
    - Invoke the full BLIS testsuite via 'make testblis' instead of the fast
      version via 'blistest-fast' (which was wrong anyway, since the correct
      fast traget is 'testblis-fast').
    - Invoke the BLAS tests via 'make testblas' instead of 'blastest'.

commit 664ec4813d8b53121cce7a68bef47da656ece9cb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 20 13:54:58 2018 -0500

    Integrated f2c'ed netlib BLAS test suite.
    
    Details:
    - Created a new test suite that exercises only the BLAS compatibility
      found in BLIS. The test suite is a straightforward port of code
      obtained from netlib LAPACK, run through f2c and linked to a stripped-
      down version of libf2c that is compiled along with the test drivers
      (to prevent any obvious ABI issues). The new BLAS test suite can be
      run from within its new local directory, 'blastest' (through its local
      'make ; make run' targets) or from the top-level Makefile (via the
      'make testblas' target). Output files are created in whatever directory
      the test drivers are run, whether it be the 'blastest' directory, the
      top-level source distribution directory, or the out-of-tree directory
      in which 'configure' was run. Also, the results of the BLAS test suite
      can be checked via 'make checkblas', which summarizes the presence or
      absence of test failures in a single line printed to stdout.
    - Updated the 'test' target to run both 'testblis' and 'testblas'.
    - Added a new 'testblis-fast' target that runs the BLIS testsuite with
      smaller problem sizes, allowing it to finish more quickly.
    - Added a 'make check' target, which runs 'checkblis-fast' and
      'checkblas'.
    - Changed .travis.yml so that Travis CI runs 'testblis-fast' instead of
      'testblis' before (calling the check-blistest.sh script to check the
      result manually).
    - Renamed some targets in the top-level Makefile to be consistent between
      BLAS and BLIS.

commit fc53ad6c5b2e39238b1bbbf625cc0c638b9da4e1
Author: Nisanth M P <nisanth.padinharepatt@amd.com>
Date:   Mon Mar 19 12:49:26 2018 +0530

    Re-enabling the small matrix gemm optimization for target zen
    
    Change-Id: I13872784586984634d728cd99a00f71c3f904395

commit d12d34e167d7dc32732c0ed135f8065a55088106
Author: Nisanth M P <nisanth.padinharepatt@amd.com>
Date:   Mon Mar 19 11:34:32 2018 +0530

    Re-enabling Zen optimized cache block sizes for config target zen
    
    Change-Id: I8191421b876755b31590323c66156d4a814575f1

commit 40fa10396c0a3f9601cf49f6b6cd9922185c932e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Mar 19 18:19:43 2018 -0500

    Fixed a few obscure bugs in the BLAS API.
    
    Details:
    - Fixed a missing parameter in the definition of sdsdot_(). The 'sb'
      argument was missing. Strangely, the argument is omitted from dsdot_()
      in the BLAS API.
    - Fixed the missing 'c' or 'u' in the "?gerc" or "?geru" operation string
      passed to xerbla_() by the bla_ger_check() macro.
    - For bla_syrk_check() and bla_syr2k_check() macros, only allow
      conjugate-transpose (trans='c') as a valid argument for the real
      domain functions [sd]syrk_() and [sd]syr2k_(). (Previously, the
      argument was allowed even for the complex domain equivalents, which
      was inconsistent with the BLAS API.)

commit fe7d7f1e43e4c26249eed83d4188beee1ba96202
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Mar 18 19:43:06 2018 -0500

    Fixed cpp macro parameter "ch" typo in bla_ger.c.
    
    Details:
    - Previously, the BLAS routine-generating macro in bla_ger.c was
      incorrectly passing MKSTR(ch) into the _check() macro when it
      should have been passing in the char that was available, chxy.
      I've instead changed the name of the macro parameter from chxy
      to ch. Similar change as made to bla_ger.h for consistency.
      Thanks to Dave Love in helping track this down. (NOTE: This is
      actually the root cause of the bug that was first patched by
      increasing the length of the operation name strings passed into
      xerbla_(), as defined by the constant BLIS_MAX_BLAS_FUNC_STR_LENGTH,
      in 3d1a5a7. In theory, that change could be backed out now.)
    - Applied aforementioned chxy->ch change to bla_dot.[ch], as well as
      frame/compat/cblas/f77_sub/f77_dot_sub.[ch] (not because it needed
      to happen, but for naming consistency).
    - Reformatted function signatures/prototypes of CBLAS functions and
      function calls to BLAS in frame/compat/cblas/f77_sub/*.c.

commit cb7ed90752d1ddbac11368c4510641ca4f3a02eb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Mar 16 13:05:56 2018 -0500

    Convert op names to uppercase before calling xerbla_().
    
    Details:
    - Defined a new function, bli_string_mkupper(), that calls toupper() on
      every non-NULL character in a string.
    - Call bli_string_mkupper() prior to calling xerbla_() in the level-2/-3
      BLAS _check() macros. This prevents the BLAS testsuite from complaining
      that the operation name (e.g. "dgemm") does not match the expected
      value (e.g. "DGEMM"). Thanks to Dave Love for reporting this issue.

commit 3d1a5a7c08fed3ba29f060fe1db2b0dc42dde223
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Mar 16 12:24:07 2018 -0500

    Fixed printf() format overflow.
    
    Details:
    - Increased the length of operation name strings passed to xerbla_() in
      the level-2 and level-3 operation _check() functions, found in
      frame/compat/check. This avoids a format specifier overflow warning by
      gcc 7. Thanks to Dave Love for reporting this issue and suggesting the
      fix.

commit c73055f028684d998e03b2392093c393782bbfe7
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Mar 15 16:08:21 2018 -0500

    Return after non-zero info in BLAS checks.
    
    Details:
    - Previously, when calling the BLAS compatibility layer, discovering a
      parameter check failure would result in the proper setting of the
      info parameter (printed by xerbla_()), but would also come with an
      immediate abort() rather than a return. This was incorrect behavior
      for two overlapping reasons.
      (1) BLAS should return gracefully to the caller in the event of a
          bad set of parameters, not abort().
      (2) When BLIS was being tested via the BLAS testsuite, BLIS's
          xerbla_() would correctly get preempted/overridden by the
          xerbla_() in the BLAS testsuite, but execution would then
          erroneously continue on to the BLIS implementation with bad
          parameter values.
    - The previous issue was addressed by disabling the abort() in BLIS's
      xerbla_(), changing all of the BLAS _check() functions to cpp macros,
      and adding a return statement to the end of each _check() macro's
      "if ( info != 0 )" conditional.
      Thanks to Dave Love for reporting this issue.

commit c4f1d18b97a6a8c3ea0366aa759db597a664062a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Mar 14 19:10:09 2018 -0500

    Minor typo fix to printing arch in testsuite.
    
    Details:
    - Mistakenly was calling bli_cpuid_query_id() instead of
      bli_arch_query_id() in the recent addition to the testsuite output
      that prints the active sub-configuration. The former function is
      only used for multi-architecture builds, whereas the latter is the
      more general option that also works for single configuration
      (including 'configure auto') builds.

commit 8f2fabec800a720b3e94b33c0048cc8c4ead436d
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Mar 14 17:43:42 2018 -0500

    Make arm32 and arm64 families work. (#176)

commit fc6a1842518a0820c6708c285611346d5a1419da
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Mar 14 15:31:17 2018 -0500

    Print sub-configuration name in testsuite output.
    
    Details:
    - Added a line to the testsuite output that prints the name of the
      current/active sub-configuration. This is useful when linking the
      testsuite against multi-configuration builds because it confirms
      the sub-configuration that is actually being employed at runtime.
      Thanks to Devin Matthews for suggesting this feature.

commit 9943a899d64bf7ec4a24106f6f4c70629bbe1f6e
Merge: 290dd4a9 b1a15ae6
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Mar 14 13:27:44 2018 -0500

    Merge pull request #173 from devinamatthews/dev
    
    Fix Cortex-A9 and Cortex-A15 configs.

commit b1a15ae6ee0f46c9a95cf59f9555925e0e8e21ff
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Mar 14 13:26:44 2018 -0500

    Use BLIS_H_FLAT

commit 290dd4a9feee447e69b40ad108954af78e196f7e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Mar 14 13:15:37 2018 -0500

    Allow arbitrarily deep configuration families.
    
    Details:
    - Updated configure so that configuration families specified in the
      config_registry are no longer constrained as being only one level
      deep. For example, previously the x86_64 family could not be defined
      concisely in terms of, say, intel64 and amd64 families, and instead
      had to be defined as containing "haswell, sandybridge, penryn, zen,
      etc." In other words, families were constrained to only having
      singleton configurations as their members. That constraint is now
      lifted.
    - Redefined x86_64 family in config_registry in terms of intel64 and
      amd64.

commit 9cee78e006d56543ac02fc9c488905c0434e60ae
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Mar 14 13:09:48 2018 -0500

    Fix Cortex-A9 and Cortex-A15 configs.
    
    Tested with QEMU.

commit 1a3031740f7fcbbcc2c99d5c4cb50d0413407455
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 13 16:04:40 2018 -0500

    Updates to ARM hardware detection support.
    
    Details:
    - Updated/clarified the ARM preprocessor macro branch of bli_cpuid.c.
      Going forward, cortexa57 (64-bit), cortexa15, and cortexa9 (32-bit)
      sub-configurations are supported. However, the functions that detect
      features specific to a15 and a9 are identical, and since a15 is tested
      first, it will always be chosen for arm32 hardware (even if both
      sub-configurations were enabled at configure-time and the library is
      linked and run on an a9). Thus, more work needs to be done to
      distinguish these two.
    - Added cpp guard around x86_64 portions of bli_cpuid.c. Now, either
      the x86_64 or ARM code will be compiled (or neither, if neither
      environment is detected).
    - In bli_arch_query_id(), call bli_cpuid_query_id() when the
      BLIS_FAMILY_ARM64 or BLIS_FAMILY_ARM32 macros are defined.
    - Added arm64 and arm32 configuration families to config_registry.
    - Added a note to the arch_t typedef enum in bli_type_defs.h reminding
      the developer to update the string array in bli_arch.c whenever new
      enum values are added or existing values are reordered.

commit 1442d06886ebdc34d8f1cb620229ddc6062c2ce8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Mar 11 16:59:50 2018 -0500

    Fixed misnamed kernels in _cntx_init_cortexa57.c.
    
    Details:
    - Changed incorrect kernel function names in bli_cntx_init_cortexa57.c:
        bli_sgemm_cortexa57_asm_8x12 -> bli_sgemm_armv8a_asm_8x12
        bli_dgemm_cortexa57_asm_6x8  -> bli_dgemm_armv8a_asm_6x8
      Thanks to Jacob Gorm Hansen for reporting this issue.

commit 28bcea37dfcf0eb99a99da6f46de2a2830393d1d
Merge: b1ea3092 8b0475a8
Author: praveeng <praveen.g@amd.com>
Date:   Fri Mar 9 19:13:08 2018 +0530

    Merge master code till 06_mar_2018 to amd-staging
    
    Change-Id: I12267e5999c92417e3715fef4f36ac2131d00f1a

commit 48da9f5805f0a49f6ad181ae2bf57b4fde8e1b0a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Mar 7 12:54:06 2018 -0600

    Tweaked common.mk, Makefile, skx/knl make_defs.mk.
    
    Details:
    - Reorganized linker-related section of common.mk so that LDFLAGS set
      in a sub-configuration's make_defs.mk file will not be immediately
      (and erroneously) overridden by the default values.
    - Re-enabled redirected (to file) output of the testsuite when run from
      the top-level Makefile via 'make test'. (For some reason, it was
      commented-out for the non-verbose case.)
    - Removed old/unnecessary code from the make_defs.mk files of skx and
      knl sub-configurations.

commit 8b0475a87daa177916e2caac0e530c6a57fa07cf
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 6 06:39:44 2018 -0600

    Fixed typo in attempted fix in 1a8350f7.
    
    Details:
    - Mistakenly entered 148 as knl mc blocksize for double real when the
      value should have been 144. Thanks to Dave Love for reporting this.

commit 8912e6886b97eabb4ce0c35a3609a0fd994d347b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Mar 5 18:00:45 2018 -0600

    Fixed missing flags during shared object build.
    
    Details:
    - Fixed a bug in common.mk that caused warning, position-independent
      code, miscellaneous, and general preprocessor flags to be omitted
      from the configuration family-specific variables that hold those
      values, as registered by the family's make_defs.mk file. This would
      most obviously manifest when targeting a configuration family such as
      'intel64' while simultaneously configuring for a shared object build,
      as the key '-fPIC' flag would be omitted at compile-time and prevent
      successful linking. Thanks to Dave Love for reporting this bug.
    - Other cleanups to common.mk for readability and clarity.

commit 1a8350f70557fc53ca0c2eadf2076710dd0d9bc9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Mar 5 13:32:00 2018 -0600

    Fixed cache blocksize bug in knl configuration.
    
    Details:
    - Changed the mc blocksize for double real execution in the knl sub-
      configuration from 160 to 148. The old value was not a multiple of
      mr (which is 24), and thus the safeguards in bli_gks_register_cntx()
      were tripping. Thanks for Dave Love for reporting this issue.
    - Switch knl sub-configuration to use default blocksizes for datatypes
      not supported by native kernels.
    - Fixed typos in bli_error.c that prevented certain error strings
      (which report maximum cache blocksizes not being multiples of their
      corresponding register blocksize) from properly initializing.

commit c09fffa827fe6241dc20193a1c404496664220de
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Mar 3 13:13:39 2018 -0600

    Added missing cntx_t* arg in knl packm kernels.
    
    Details:
    - Added the missing cntx_t* argument to the function signature of packm
      kernels in kernels/knl/1m/. Thanks to Dave Love for reporting this
      issue.

commit b1ea30925dff751eced23dfa94ff578a20ea0b94
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Feb 23 17:42:48 2018 -0600

    CHANGELOG update (0.3.0)
    
    Change-Id: Id038b00a62de51c9818ad249651ec5dc662f4415

commit 1ef9360b1fd0209fbeb5766f7a35402fbd080fcb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Mar 1 14:36:39 2018 -0600

    Enable non-unit vector stride tests by default.
    
    Details:
    - Change "vector storage schemes to test" parameter in testsuite's
      input.general file to "cj". This means that both unit stride column
      vectors and non-unit stride column vectors will be tested in
      operations with vector operands (e.g. level-1v, level-1f, level-2).
    - Very minor comment (typo) changes to input.operations.

commit 8c4e55a1a1ead9a5e970200fee027ffd2c7e8454
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Feb 28 17:01:47 2018 -0600

    Added individual operation overrides in testsuite.
    
    Details:
    - Updated the testsuite driver so that setting one or more individual
      operation test switches to "2" in input.operations will enable ONLY
      those operations and disable all others, regardless of the values of
      the section overrides and other operation switches. This makes it
      every easy to quickly test only one or two operations, and equally
      easy to revert back to the previous combination of operation tests.
    - Added more comments to input.operations describing the use of
      individual "enable only" overrides.

commit 34862aed89e5d5a8f35aeecd49f3052ada1f337b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Feb 28 15:30:14 2018 -0600

    Use zen kernels in haswell sub-configuration.
    
    Details:
    - Register use of level-1v zen intrinsic kernels for amaxv, axpyv, dotv,
      dotxv, and scalv, as well asl level-1f zen intrinsic kernels for axpyf
      and dotxf. This works because these kernels simply target AVX/AVX2,
      and therefore work without modification on haswell hardware.
    - Switch to use of zen microkernels in bli_cntx_init_haswell.c. The zen
      kernels are essentially identical to those used by haswell, except that
      now zen kernels are a bit more up-to-date. In the future, I may
      continue to maintain duplicates, or I may keep the kernels named after
      one architecture (zen or haswell) but used by both sub-configurations.
    - In config_registry, enable use of both haswell and zen kernels for the
      haswell sub-configuration. This is necessary in order to make zen
      kernels visible when registering kernels in bli_cntx_init_haswell.c.
    - Enable use of assembly-based complex gemm microkernels for zen,
      bli_cgemm_zen_asm_3x8() and bli_zgemm_zen_asm_3x4(), in
      bli_cntx_init_zen.c. This was actually intended for 1681333.

commit 709f8361ebc90b96b02ebe5c5ffb6fc3b1b25e58 (tag: 0.3.0)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Feb 23 17:42:48 2018 -0600

    Version file update (0.3.0)

commit d9079655c9cbb903c6761d79194a21b7c0a322bc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Feb 23 17:42:48 2018 -0600

    CHANGELOG update (0.3.0)

commit 3defc7265c12cf85e9de2d7a1f243c5e090a6f9d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Feb 23 17:38:19 2018 -0600

    Applied 34b72a3 to non-active/unused microkernels.
    
    Details:
    - Applied the read-beyond-bounds bugfix in 34b72a3 to other haswell and
      zen kernels (ie: other microtile shapes) which are not used by default.
      This was done mostly in case someone decided to pick up these kernels
      and start using them, not because it affects BLIS's behavior
      out-of-the-box.

commit 34b72a351745aa0d47bb0b74ebcd0f0a616d613d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Feb 23 16:33:32 2018 -0600

    Fixed obscure read-beyond-bounds bug in sgemm ukrs.
    
    Details:
    - Fixed an obscure bug in the bli_sgemm_haswell_asm_6x16 and
      bli_sgemm_zen_asm_6x16 microkernels when the input/output matrix C
      is stored with general stride (ie: both rs and cs are non-unit). The
      bug was rooted in the way those microkernels read from matrix C--
      namely, they used vmovlps/vmovhps instead of movss. By loading two
      floats at a time, even if one of them was treated as junk, the
      assembly code could be written in a more concise manner. However,
      under certain conditions--if m % mr == 0 and n % nr == 0 and the
      underlying matrix is not an internal "view" into a larger matrix--
      this could result in the very last vmovhps of the last (bottom-right)
      microkernel invocation reading beyond valid memory. Specifically, the
      low 32 bits read would always be valid, but the high 32 bits could
      reside beyond the bounds of the array in which the output C matrix is
      contained. To remedy this situation, we now selectively use movss to
      load any element that could be the last element in the matrix.

commit 5112e1859e7f8888f5555eb7bc02bd9fab9b4442 (origin/rt)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Feb 23 14:31:26 2018 -0600

    Added missing 'restrict' to some kernels' cntx_t*.
    
    Details:
    - Added missing 'restrict' keyword to cntx_t* argument of function
      signatures corresponding to level-1v, level-1f, and level-1m kernels.
      This affected bli_l1v_ker_prot.h, bli_l1f_ker_prot.h, and
      bli_l1m_ker_prot.h. (The 'restrict' was already being used to
      qualify cntx_t* arguments for kernels defined in bli_l3_ker_prot.h.)
    - Added comments to bli_l1v_ker.h, bli_l1f_ker.h, bli_l1m_ker.h, and
      bli_l3_ukr.h that help explain how those headers function to produce
      kernel prototypes using the prototype macros defined in the files
      mentioned above.

commit 1fa8af95d807168e0849adb668492601e7009be0
Merge: c084b03b 16813335
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Feb 21 17:54:02 2018 -0600

    Merge branch 'rt'

commit c084b03b31d84427a120e391963db5419f1911ee
Merge: 5d03b6e6 fa74af4e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Feb 21 17:52:17 2018 -0600

    Merge branch 'rt'

commit 16813335bdb5978bc9a26cd00a32bd5a130130c4
Merge: fa74af4e 5a7005dd
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Feb 21 17:43:32 2018 -0600

    Merge branch 'amd' into rt
    
    Details:
    - Merged contributions made by AMD via 'amd' branch (see summary below).
      Special thanks to AMD for their contributions to-date, especially with
      regard to intrinsic- and assembly-based kernels.
    - Added column storage output cases to microkernels in
      bli_gemm_zen_asm_d6x8.c and bli_gemmtrsm_l_zen_asm_d6x8.c. Even with
      the extra cost of transposing the microtile in registers, this is
      much faster than using the general storage case when the underlying
      matrix is column-stored.
    - Added s and d assembly-based zen gemmtrsm_u microkernel (including
      column storage optimization mentioned above).
    - Updated zen sub-configuration to reflect presence of new native
      kernels.
    - Temporarily reverted zen sub-configuration's level-3 cache blocksizes
      to smaller haswell values.
    - Temporarily disabled small matrix handling for zen configuration
      family in config/zen/bli_family_zen.h.
    - Updated zen CFLAGS according to changes in 1e4365b.
    - Updated haswell microkernels such that:
      - only one vzeroupper instruction is called prior to returning
      - movapd/movupd are used in leiu of movaps/movups for double-real
        microkernels. (Note that single-real microkernels still use
        movaps/movups.)
    - Added kernel prototypes to kernels/zen/bli_kernels_zen.h, which is
      now included via frame/include/bli_arch_config.h.
    - Minor updates to bli_amaxv_ref.c (and to inlined "test" implementation
      in testsuite/src/test_amaxv.c).
    - Added early return for alpha == 0 in bli_dotxv_ref.c.
    - Integrated changes from f07b176, including a fix for undefined
      behavior when executing the 1m method under certain conditions.
    - Updated config_registry; no longer need haswell kernels for zen
      sub-configuration.
    - Tweaked marginal and pass thresholds for dotxf.
    - Reformatted level-1v, -1f, and -3 amd kernels and inserted additional
      comments.
    - Updated LICENSE file to explicitly mention that parts are copyright
      UT-Austin and AMD.
    - Added AMD copyright to header templates in build/templates.
    
    Summary of previous changes from 'amd' branch.
    - Added s and d assembly-based zen gemm microkernels (d6x8 and d8x6) and
      s and d assembly-based zen gemmtrsm_l microkernels (d6x8).
    - Added s and d intrinsics-based zen kernels for amaxv, axpyv, dotv, dotxv,
      and scalv, with extra-unrolling variants for axpyv and scalv.
    - Added a small matrix handler to bli_gemm_front(), with the handler
      implemented in kernels/zen/3/bli_gemm_small_matrix.c.
    - Added additional logic to sumsqv that first attempts to compute the
      sum of the squares via dotv(). If there is a floating-point exception
      (FE_OVERFLOW), then the previous (numerically conservative) code is
      used; otherwise, the result of dotv() is square-rooted and stored as
      the result. This new implementation is only enabled when FE_OVERFLOW
      is #defined. If the macro is not #defined, then the previous
      implementation is used.
    - Added axpyv and dotv standalone test drivers to test directory.
    - Added zen support to old cpuid_x86.c driver in build/auto-detect/old.
    - Added thread-local and __attribute__-related macros to bli_macro_defs.h.

commit 5d03b6e6e19d5a07f0cccf1a158f02fbd62dfd99
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Feb 19 11:31:30 2018 -0600

    Fix asm macro include line for KNL. Fixes #167.

commit f07b176c84dc9ca38fb0d68805c28b69287c938a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 15 18:36:54 2018 -0600

    Fixed an obscure bug in the 1m implementation.
    
    Details:
    - Fixed a bug in the way the bli_gemm1m_cntx_ref() function (defined in
      ref_kernels/bli_cntx_ref.c) initializes its context for 1m execution.
      Previously, the function probed the context that was in the process of
      being updated for use with 1m--this context being previously
      initialized/copied from a native context--for its storage preference
      to determine which "variant" (row- or column-oriented) of 1m would be
      needed. However, the _cntx_ref() function was not updating the method
      field of the context until AFTER this query, and the conditional which
      depended on it, had taken place, meaning the storage preference query
      function would mistakenly think the context was for native execution,
      since the context's method field would still be set to BLIS_NAT. This
      would lead it to incorrectly grab the storage preference of the complex
      domain microkernel rather than the corresponding real domain
      microkernel, which could cause the storage preference predicate to
      evaluate to the wrong value, which would lead to the _cntx_ref()
      function choosing the wrong variant. This could lead to undefined
      behavior at runtime. The method is now explicitly set within the
      context prior to calling the storage preference query function.
    - Updated comments in frame/ind/oapi/bli_l3_3m4m1m_oapi.c.
    - Fixed a typo in the commented-out CFLAGS in config/zen/make_defs.mk,
      which are appropriate for gcc 6.x and newer. (Mistakenly used
      -march=bdver4 instead of -march=znver1.)

commit 1f94bb7b96eb2b67257e6c4df89e29c73e9ab386
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jan 19 12:46:53 2018 -0600

    Document how to enable zen-specific instructions.
    
    Details:
    - Added as a comment in config/zen/make_defs.mk the list of compiler flags
      that could be added to manually enable the instructions provided by the
      Zen microarchitecture that are not already implied by -march=bdver4.
      This information, along with the previous commit's flags to selectively
      disable Bulldozer instructions no longer present in Zen, was gathered
      from [1]. I hesitate to enable use of these instructions since I don't
      have any Zen hardware to test on yet.
      [1] https://wiki.gentoo.org/wiki/Ryzen

commit 1e4365b21bafa02bd108c5ac4705a25671fb9441
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jan 18 12:03:51 2018 -0600

    Augment zen CFLAGS to prevent illegal instruction.
    
    Details:
    - Added various compiler flags (-mno-fma4 -mno-tbm -mno-xop -mno-lwp) so
      that compiling with -march=bdver4 on zen-based architectures does not
      result in an illegal instruction error at runtime. Note: This fix is
      only needed for gcc 5.4; gcc 6.3 or later supports the use of
      -march=znver1, which can be used in lieu of the augmented set of flags
      based on bdver4. Thanks to Nisanth Padinharepatt for reporting this
      error.

commit fa74af4e1fa7385ac3f3089fe1ea7bb88c906029
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jan 9 13:43:15 2018 -0600

    Minor labeling update for './configure -c' output.
    
    Details:
    - Print the name of the configuration in the output of the
      kernel-to-config map (and chosen pairs list) as a subtle way to remind
      the user that these only apply to the targeted configuration (whereas
      the config list and kernel list are printed without regard to which
      configuration was actually targeted).

commit 5cdea756c7391e2c6cbfb38436ef9a205f860237
Merge: 9d8858b5 1e7a4896
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Jan 7 19:45:20 2018 -0600

    Merge branch 'rt'

commit 9d8858b5cff4a4b078b87872847a5710073fff0a
Merge: 0b3ca3cf f7df64da
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Sun Jan 7 10:03:25 2018 -0600

    Merge pull request #164 from devinamatthews/master
    
    Don't use memkind for skx configuration.

commit f7df64daf6bbe6431effada6e13d8d1fab5aa221
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Sun Jan 7 09:37:25 2018 -0600

    Don't use memkind for skx configuration. Fixes #163.

commit 1e7a4896e0cbe73c4685fa956278e3f28273cdf9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jan 5 12:33:48 2018 -0600

    Minor error handling in update-version-file.sh.
    
    Details:
    - Added explicit handling of situations when 'git describe --tags'
      returns an error. This command is used by update-version-file.sh
      when deciding whether or not to update the version file prior to
      configuration.
    - Removed bli_packm.c and bli_unpackm.c, as they contained no source
      code.

commit 0b3ca3cfb682715a3686fd93ebb10d4a695d1162
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jan 4 20:51:35 2018 -0600

    Intelligently select compiler for auto-detection.
    
    Details:
    - Rewrote code that selects the compiler for the purposes of compiling
      the auto-detection executable. CC (if specified) is tried first. Then
      gcc. Then clang. The absolute fallback is cc. The previous code was
      sort of broken, and seemed to unintentionally always use gcc.
    - Moved various configuration-agnostic flags from config/*/make_defs.mk
      files to common.mk. The new mechanism appends the configuration-
      agnostic flags to the various compiler flag variables initialized in
      make_defs.mk. Flags specific to the sub-configuration are still set
      in make_defs.mk.
    - Added -Wno-tautological-compare to CMISCFLAGS when clang is in use.
      Also added the flag to the compiler instantiation during configure-
      time hardware detection (when clang is selected).
    - Added some missing (but mostly-optional) quotes to configure script.

commit 5a7005dd44ed3174abbe360981e367fd41c99b4b
Merge: 7be88705 3bc99a96
Author: Nisanth M P <nisanth.padinharepatt@amd.com>
Date:   Wed Jan 3 12:05:12 2018 +0530

    Merge changes in AMD beta release 0.95 into amd branch

commit 0b9c5127e91508c115228ca604ee2dac8de8f477
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Dec 23 15:53:44 2017 -0600

    Enabled C99, added stdint.h to auto-detect build.
    
    Details:
    - Added "-std=c99" to compiler arguments when building auto-detection
      driver in configure script.
    - Added #include <stdint.h> to all three source files needed by auto-
      detection program.

commit 0ce5e19c318e04909d3e664d69accb3a0fc6b988
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Dec 23 15:32:03 2017 -0600

    Reimplemented configure-time hardware detection.
    
    Details:
    - Reimplemented the hardware detection functionality invoked when running
      "./configure auto". Previously, a standalone script in build/auto-detect
      that used CPUID was used. However, the script attempted to enumerate all
      models for each microarchitecture supported. The new approach recycles
      the same code used for runtime hardware detection introduced in 2c51356.
      This has two immediate benefits. First, it reduces and consolidates the
      code required to detect microarchitectures via the CPUID instruction.
      Second, it provides an indirect way of testing at configure-time the
      code that is used to detect hardware at runtime. This code is (a) only
      activated when targeting a configuration family (such as intel64 or
      amd64) at configure-time and (b) somewhat difficult to test in
      practice, since it relies on having access to older microarchitectures.
    - The above change required placing conditional cpp macro blocks in
      bli_arch.c and bli_cpuid.c which either #include "blis.h" or #include
      a bare-bones set of headers that does not rely on the presence of a
      bli_config.h header. This is needed because bli_config.h has not been
      created yet when configure-time auto-detection takes places.
    - Defined a new function in bli_arch.c, bli_arch_string(), which takes
      an arch_t id and returns a pointer to a string that contains the
      lowercase name of the corresponding microarchitecture. This function
      is used by the auto-detection script to printf() the name of the
      sub-configuration corresponding to the detected hardware.

commit 9804adfd405056ec332bb8e13d68c7b52bd3a6c1 (origin/selfinit)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Dec 21 19:22:57 2017 -0600

    Added option to disable pack buffer memory pools.
    
    Details:
    - Added a new configure option, --[en|dis]able-packbuf-pools, which will
      enable or disable the use of internal memory pools for managing buffers
      used for packing. When disabled, the function specified by the cpp
      macro BLIS_MALLOC_POOL is called whenever a packing buffer is needed
      (and BLIS_FREE_POOL is called when the buffer is ready to be released,
      usually at the end of a loop). When enabled, which was the status quo
      prior to this commit, a memory pool data structure is created and
      managed to provide threads with packing buffers. The memory pool
      minimizes calls to bli_malloc_pool() (i.e., the wrapper that calls
      BLIS_MALLOC_POOL), but does so through a somewhat more complex
      mechanism that may incur additional overhead in some (but not all)
      situations. The new option defaults to --enable-packbuf-pools.
    - Removed the reinitialization of the memory pools from the level-3
      front-ends and replaced it with automatic reinitialization within the
      pool API's implementation. This required an extra argument to
      bli_pool_checkout_block() in the form of a requested size, but hides
      the complexity entirely from BLIS. And since bli_pool_checkout_block()
      is only ever called within a critical section, this change fixes a
      potential race condition in which threads using contexts with different
      cache blocksizes--most likely a heterogeneous environment--can check
      out pool blocks that are too small for the submatrices it wishes to
      pack. Thanks to Nisanth Padinharepatt for reporting this potential
      issue.
    - Removed several functions in light of the relocation of pool reinit,
      including bli_membrk_reinit_pools(), bli_memsys_reinit(),
      bli_pool_reinit_if(), and bli_check_requested_block_size_for_pool().
    - Updated the testsuite to print whether the memory pools are enabled or
      disabled.

commit 107801aaae180c00022f1b990bc59038c14949d2
Merge: d9c05745 0084531d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 18 16:29:28 2017 -0600

    Merge branch 'master' into selfinit

commit 0084531d3eea730a319ecd7018428148c81bbba7
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Dec 17 18:58:25 2017 -0600

    Updated flatten-headers.py for python3.
    
    Details:
    - Modifed flatten-headers.py to work with python 3.x. This mostly
      amounted to removing print statements (which I replaced with calls
      to my_print(), a wrapper to sys.stdout.write()). Thanks to Stefan
      Husmann for pointing out the script's incompatibility with python 3.
    - Other minor changes/cleanups.

commit 90b11b79c302f208791bdfb1ed754873103c7ce5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Dec 17 17:34:32 2017 -0600

    Modest performance boost to flatten-headers.py.
    
    Details:
    - Updated flatten-headers.py to pre-compile the main regular expression
      used to isolate #include directives and the header filenames they
      reference. The compiled regex object is then used over and over on
      each header file in the tree of referenced headers. This appears to
      have provided a 1.7-2x performance increase in the best case.
    - Other minor tweaks, such as renaming the main recursive function from
      replace_pass() to flatten_header().

commit 99dee87f30b4d437fa6b5e4ba862526d07b9f08b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Dec 17 16:47:27 2017 -0600

    Reimplemented flatten-headers.sh in python.
    
    Details:
    - Added flatten-headers.py, a python implementation of the bash script
      flatten-headers.sh. The new script appears to be 25-100x faster,
      depending on the operating system, filesystem, etc. The python script
      abides by the same command line interface as its predecessor and
      targets python 2.7 or later. (Thanks to Devin Matthews for suggesting
      that I look into a python replacement for higher performance.)
    - Activated use of flatten-headers.py in common.mk via the FLATTEN_H
      variable.
    - Made minor tweaks to flatten-headers.sh such as spelling corrections
      in comments.

commit d9c0574599c3f97c0f9b6c334a077bab9452e1f4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Dec 14 17:13:42 2017 -0600

    Allow travis failures of OS X builds that run testsuite.
    
    Details:
    - Added an allowance for OS X builds that run the testsuite to fail.
      There seems to be an issue with 1m when running in Travis CI under
      OS X and clang, but only in double-precision. Haven't been able to
      reproduce the error on my own, and thus, I can't debug it. (Hopefully
      it is simply a version-specific compiler bug.)

commit 86cd23b7379b00a42b4ecc04fa668f1e3f9b54ee
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Dec 14 15:47:41 2017 -0600

    Fixed testsuite Makefile brokenness from 9091a207.
    
    Details:
    - Fixed a makefile error encountered when building the testsuite directly
      in its directory (as opposed to indirectly via 'make test'). The fix
      involves introducing a new variable, BUILD_PATH, alongside the existing
      DIST_PATH variable. By default, BUILD_PATH is set to the current
      directory, and is overridden by other Makefiles used by, for example,
      the testsuite and standalone test drivers in testsuite or test,
      respectively.
    - Some files/directories in common.mk were redefined in terms of
      BUILD_DIR, such as the locations of config.mk file and the intermediate
      include directory.

commit 6a3a8924c04d25507fc4aa593df30c56c7dc12f7
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Dec 14 13:20:02 2017 -0600

    Temporarily show Makefile's testsuite output.
    
    Details:
    - Disabled redirection of testsuite output for 'test' target. This is
      part of an attempt to debug a segmentation fault on OS X via Travis.

commit 9a01080dd426915bed18229f70401bfa639dc283
Merge: 83316485 a32e8a47
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Dec 14 11:27:19 2017 -0600

    Merge branch 'master' into selfinit

commit a32e8a47c022b6071302b2956af5728976c83ca9 (origin/travis)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Dec 13 16:31:36 2017 -0600

    Added an exclusion to .travis.yml.
    
    Details:
    - Added exclusion for out-of-tree builds on OS X (clang).

commit b9f7d987df548965c86e16e0ba94d5cad0d9b399
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Dec 13 16:22:09 2017 -0600

    Cleaned up after previous travis oot debugging.
    
    Details:
    - Removed debugging output from common.mk related to Travis CI
      out-of-tree builds.
    - Other minor cleanups to common.mk.

commit 9091a207aa8c49e279676ea02be533480b3b0d5a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Dec 13 16:12:34 2017 -0600

    Attempted fix to travis oot build failure.
    
    Details:
    - Found the likely cause of the Travis CI out-of-tree build failures:
      config.mk was being read from DIST_PATH, rather than the current
      directory.

commit c01c71c33e236e6c91f5ddd3ec1e3faec89368c1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Dec 13 15:58:50 2017 -0600

    Added debugging output to Makefile.
    
    Details:
    - Added $(info ...) statements in key locations in an attempt to reveal
      why Travis CI doesn't like building BLIS out-of-tree.

commit 784289d69dd6b3692444d3b3e290f6a014465b72
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Dec 13 15:31:27 2017 -0600

    Updated SHELL in common.mk from /bin/bash to bash.

commit d9bb1d1d4ebc89ea75d9d927d09882162a914f77
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Dec 13 15:27:54 2017 -0600

    Defined SHELL in common.mk so "echo -n" works.
    
    Details:
    - Defined the SHELL variable in common.mk as "/bin/bash" so that the
      -n option can be used with echo in the Makefile rule for flattening
      blis.h. Thanks to Devin Matthews for suggesting this fix.

commit 9289a08667df2044f3a37af54d893efe2b56d555
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Dec 13 15:14:27 2017 -0600

    Attempt 3 on .travis.yml.

commit 720bfcf0ef54fdc41df0dcaa94503edb0d5c8972
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Dec 13 14:52:28 2017 -0600

    More fixes to .travis.yml.
    
    Details:
    - Fixed a mistake (hopefully) in d0c4dd0 that resulted in many more
      osx/clang sub-tests than intended.
    - Shortened the variable names in an effort to make them more readable
      via the Travis CI web interface.

commit 8717c9c97fe9b1ecd3b3192049a73976f8390ca7
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Dec 13 14:36:37 2017 -0600

    Added 'pwd' commands to .travis.yml for debugging.
    
    Details:
    - Added 'pwd' commands to the script portion of the .travis.yml file in
      an attempt to uncover the problem with the recent out-of-tree build
      testing changes made in d0c4dd0.

commit 83316485ce10f6fcafe92a1c146282de0dd8068a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Dec 13 14:14:50 2017 -0600

    Simplified/fixed self-initialization.
    
    Details:
    - Fixed a race condition in self-initialization whereby the bli_is_init
      static variable could be erroneously read as TRUE by thread 1 while
      thread 0 is still executing bli_init_apis(), thus allowing thread 1 to
      use the library before it is actually ready. Thanks to to Minh Quan Ho
      and Devin Matthews for pointing out this issue.
    - Part of the solution to the aforementioned race condition was involved
      replacing the runtime initialization of the global scalar constants
      (e.g., BLIS_ONE, BLIS_ZERO, etc.) in bli_const.c with a static
      initialization of those same constants. This eliminates the need for
      bli_const_init() altogether. (The static initialization is made concise
      via preprocess macros.)
    - Defined bli_gks_query_cntx_noinit(), which behaves just like
      bli_gks_query_cntx(), except that it does not call bli_init_once(). This
      function is called in lieu of bli_gks_query_cntx() in bli_ind_init() and
      bli_memsys_init() so as to not result in any recursion into
      bli_init_once().
    - Removed BLIS_ONE_HALF, BLIS_MINUS_ONE_HALF global scalar constants.
      They have no use in BLIS or its test products, and we have little reason
      to believe they are used by others.
    - Removed testsuite/out file, which was accidentally committed as part
      of 70640a3.

commit 6526d1d4ae6dbfa854ca8d1e5f224cd6ab3fa958
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Dec 12 13:50:43 2017 -0600

    Added temp_dir argument to flatten-headers.sh.
    
    Details:
    - Added "temp_dir" argument to flatten-headers.sh so that the caller can
      specify where intermediate files should be created as the script runs.
    - Updated flatten-headers.sh to create intermediate files in temp_dir
      instead of alongside the corresponding source files. This should now
      (once again) allow out-of-tree builds where the BLIS distribution is
      read-only, or where the out-of-tree build is running concurrently with
      another out-of-tree build. (Thanks to Devin Matthews for pointing out
      the possibility of simultaneous out-of-tree builds.)

commit 94755017c967630daf2e31c1f63ed5e88ab0d6ab
Merge: d0c4dd00 5cf7b0c4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Dec 12 12:50:41 2017 -0600

    Merge branch 'master' of github.com:flame/blis

commit d0c4dd000ff38acc249e8acf7e0655a523991695
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Dec 12 12:47:53 2017 -0600

    Added out-of-tree build test to .travis.yml file.
    
    Details:
    - Modified .travis.yml file to include an out-of-tree build test (using
      the "auto" configure target). Thanks to Devin Matthews for this
      suggestion.

commit 5cf7b0c4e52922069183a87dc2aa177419644e04
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue Dec 12 12:38:48 2017 -0600

    Ignore blis.h.interm [ci skip]

commit 8d8ff74d15b4a584929cec36034ba6d3c53f7d27
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Dec 12 12:32:50 2017 -0600

    Further attempt to fix out-of-tree builds.
    
    Details:
    - Fix applied in 87978f6 was necessary but not sufficient to fix
      out-of-tree builds. It turns out that using a source tree that had
      already built the target erroneously gave the impression that
      out-of-tree builds were working again, when in fact they were still
      broken. The additional changes in this commit should complete the
      fix that was started in the aforementioned commit. Thanks to Devin
      Matthews and Shaden Smith for their help in isolating this issue.

commit 70640a37109290b57c344083c00624e13c496e30
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 11 17:18:43 2017 -0600

    Implemented library self-initialization.
    
    Details:
    - Defined two new functions in bli_init.c: bli_init_once() and
      bli_finalize_once(). Each is implemented with pthread_once(), which
      guarantees that, among the threads that pass in the same pthread_once_t
      data structure, exactly one thread will execute a user-defined function.
      (Thus, there is now a runtime dependency against libpthread even when
      multithreading is not enabled at configure-time.)
    - Added calls to bli_init_once() to top-level user APIs for all
      computational operations as well as many other functions in BLIS to
      all but guarantee that BLIS will self-initialize through the normal
      use of its functions.
    - Rewrote and simplified bli_init() and bli_finalize() and related
      functions.
    - Added -lpthread to LDFLAGS in common.mk.
    - Modified the bli_init_auto()/_finalize_auto() functions used by the
      BLAS compatibility layer to take and return no arguments. (The
      previous API that tracked whether BLIS was initialized, and then
      only finalized if it was initialized in the same function, was too
      cute by half and borderline useless because by default BLIS stays
      initialized when auto-initialized via the compatibility layer.)
    - Removed static variables that track initialization of the sub-APIs in
      bli_const.c, bli_error.c, bli_init.c, bli_memsys.c, bli_thread, and
      bli_ind.c. We don't need to track initialization at the sub-API level,
      especially now that BLIS can self-initialize.
    - Added a critical section around the changing of the error checking
      level in bli_error.c.
    - Deprecated bli_ind_oper_has_avail() as well as all functions
      bli_<opname>_ind_get_avail(), where <opname> is a level-3 operation
      name. These functions had no use cases within BLIS and likely none
      outside of BLIS.
    - Commented out calls to bli_init() and bli_finalize() in testsuite's
      main() function, and likewise for standalone test drivers in 'test'
      directory, so that self-initialization is exercised by default.

commit 70a64432ee5a7adbee10fb7ff6d7b608c1940a7a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 11 13:14:20 2017 -0600

    Fixed off-by-one indexing in bli_cpuid.c.
    
    Details:
    - In bli_cpuid.c, fixed an off-by-one indexing statement in vpu_count()
      whereby a string-terminating NULL character, '\0', is written beyond
      the bounds of the model_num string.
    - Minor whitespace and formatting edits to bli_cpuid.c.

commit 87978f6261a080d261d01f9acf4e9cc18855c833
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 11 12:49:03 2017 -0600

    Fixed broken out-of-tree builds since 52f9e6f.
    
    Details:
    - Added missing $(DIST_PATH)/ prefix to relative path to flatten-headers.sh
      script in common.mk so that the script could be found during out-of-tree
      builds. Thanks to Devin Matthews for reporting this bug.

commit 513ef4d040f89a18dda5154e8c4cf1aaf7463999
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 11 12:35:59 2017 -0600

    Various typecasting fixes, mis-typed enums, etc.
    
    Details:
    - Fixed implicit typecasting of conj_t to trans_t in bli_[un]packm_cxk.c.
    - Properly typecast integer arguments to match format specifier in various
      calls to printf() in bli_l3_thrinfo.c, bli_cntx.c, bli_pool.c, and
      bli_util_oapi.c.
    - Fixed "unsigned less-than-comparison with zero" checks in bli_check.c,
      bli_cntx.h.
    - Fixed mis-typed enums in bli_cntx.c (e.g., l1mkr_t that should have been
      l1fkr_t or l1vkr_t).
    - Fixed instances of opid_t value BLIS_GEMM that should have been l3ukr_t
      value BLIS_GEMM_UKR in bli_cntx_ref.c.
    - NOTE: These issues were identified via compiler warnings when building
      BLIS with clang on a rather old installation of OS X:
        $ clang --version
        Apple LLVM version 5.0 (clang-500.2.79) (based on LLVM 3.3svn)
        Target: x86_64-apple-darwin15.2.0
        Thread model: posix

commit 3bc99a96a3648f51b9acdc8a8c7e1cf4eb815459
Merge: 3a441183 78199c53
Author: prangana <pradeep.rao@amd.com>
Date:   Mon Dec 11 12:53:03 2017 +0530

    Fix merge conflicts after rebase with release branch
    
    Change-Id: I581b26c6d515f717ff0dce91c7c0c92553aa2630

commit 3a44118398955d6f872e01f73ae5bb4a4f8500f7
Author: Nisanth M P <nisanth.padinharepatt@amd.com>
Date:   Wed Nov 15 11:11:17 2017 +0530

    Added AMD copyright line to the changed files in last 3 commits
    
    Change-Id: I37d5dbbbe1b199e07529610a5e9cc9e49d067c66

commit 268a56c06e94d1c388766dbfe81d54efbe432809
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 1 11:51:41 2017 -0500

    Revert to default SIMD alignment for bulldozer.
    
    Details:
    - Removed the default-overriding #define of BLIS_SIMD_ALIGN_SIZE set in
      config/bulldozer/bli_kernel.h. Not sure where this value came from, but
      it would seem to allow for insufficient starting address alignment for
      any matrices created via bli_malloc_user(), such as via
      bli_obj_create(). Thanks to Rene Sitt for reporting the behavior that
      led us to this bug.
    - This commit is a manual patch of the same fix made to the 'rt' branch
      in 8f150f2.

commit 510a6863e28277f9446abfb77f1aea9f01d37e7a
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Oct 30 10:04:42 2017 -0500

    Fix CVECFLAGS for bulldozer config.

commit c669716790bdda5d2b11ea0a026cbc121b228842
Author: Nisanth M P <nisanth.padinharepatt@amd.com>
Date:   Tue Oct 24 16:36:36 2017 +0530

    Adding __attribute__((constructor/destructor)) for CLANG case.
    
    CLANG supports __attribute__, but its documentation doesn't
    mention support for constructor/destructor. Compiling with
    clang and testing shows that it does support this.
    
    Change-Id: Ie115b20634c26bda475cc09c20960d687fb7050b

commit 24e64a9d0877d788357fc63d4b947e977f8697f7
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 18 13:41:25 2017 -0500

    Removed a duplicate bli_avx512_macros.h header.
    
    Details:
    - Removed a duplicate header file that was causing problems during
      installation for the 'knl' configuration. Thanks to Victor Eijkhout
      for reporting this issue.

commit 9c0a3c4c0260cbfefb9f11532f46508b4fd19ec2
Author: Nisanth M P <nisanth.padinharepatt@amd.com>
Date:   Mon Oct 16 22:06:57 2017 +0530

    Thread Safety: Move bli_init() before and bli_finalize() after main()
    
    BLIS provides APIs to initialize and finalize its global context.
    One application thread can finalize BLIS, while other threads
    in the application are stil using BLIS.
    
    This issue can be solved by removing bli_finalize() from API.
    One way to do this is by getting bli_finalize() to execute by default
    after application exits from main().
    
    GCC supports this behaviour with the help of __attribute__((destructor))
    added to the function that need to be executed after main exits.
    
    Similarly bli_init() can be made to run before application enters main()
    so that application need not call it.
    
    Change-Id: I7ce6cfa28b384e92c0bdf772f3baea373fd9feac

commit 83f31253eb21c5ecd8a5907835e57720daae0b8b
Author: Nisanth M P <nisanth.padinharepatt@amd.com>
Date:   Mon Oct 16 21:07:50 2017 +0530

    Thread safety: Make the global induced method status array local to thread
    
    BLIS retains a global status array for induced methods, and provides
    APIs to modify this state during runtime. So, one application thread
    can modify the state, before another starts the corresponding
    BLIS operation.
    
    This patch solves this issue by making the induced method status array
    local to threads.
    
    Change-Id: Iff59b6f473771344054c010b4eda51b7aa4317fe

commit e923402e68029be379a4297de3ac6fb155ffd928
Author: sthangar <Santanu.Thangaraj@amd.com>
Date:   Thu Sep 28 12:15:36 2017 +0530

    The inner loop paralleization is turned off by default, the JR and IR loop parameters are set to 1 by default
    
    Change-Id: I8c3c2ecbbd636259f6ffb92768ec04148205c3e5

commit a64c15de19327c7595376d699be676c7003e850e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 26 19:02:53 2017 -0500

    Fixed a pthread typo in previous commit.
    
    Details:
    - Misnamed 'pthread_mutex_t' type in bli_memsys.c as 'thread_mutex_t'.

commit 42dcd589c37e1a2473ab2e1539207da97aebc07f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 26 17:00:04 2017 -0500

    Fixed bugs in gemm/gemmtrsm ukr tests in testsuite.
    
    Details:
    - Fixed a bug in gemmtrsm test module that was due to improper partitioning
      into a k x k triangular matrix for the purposes of obtaining an mr x k
      micropanel of A with which to test.
    - Fixed a bug in gemm and gemmtrsm test modules that would only manifest for
      very large k (depending on the product of mr x kc on that architecture).
      The bug arose from the fact that the test module was triggering the
      allocation of blocks from the internal memory pools, which are limited in
      size. This allocation imposes an implicit assumption that the micro-
      panel being tested with will fit inside, and this assumption is violated
      for large values of k. Arbitrarily large k may now be tested for both
      operation tests.
    - Added OpenMP/pthread critical sections around the setting or getting of
      statuses from the induced method operation lookup table in bli_l3_ind.c.
    - Added the 'static' keyword to all pthread_mutex_t global variables in BLIS.
    - Thanks to Nisanth Padinharepatt of AMD for reporting the first and third
      issues.

commit 206beb68ff73b75f5c382413967aacbb8a0aac3a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Sep 9 14:10:15 2017 -0500

    Updated bibtex info for BLIS5 (3m4m) article.

commit 0c8c0363aeb1f4aa88f7ec2d02403dab05a6e014
Author: sthangar <Santanu.Thangaraj@amd.com>
Date:   Mon Aug 28 16:44:42 2017 +0530

    Bug fix for the testsuite build failing
    
    Change-Id: I7cd8c9d187387c48b2564e45cbfb8df985e93d77

commit 63d1c84465b50f64787808dd3e8494e683c16821
Author: sthangar <Santanu.Thangaraj@amd.com>
Date:   Wed Aug 23 13:01:14 2017 +0530

    Adding auto hardware detection for Zen
    
    Change-Id: I40ce6705dd66b35000c4ccddffad1c5b65998caf

commit 537fb2a895b09be94b11947696fd2da629be24dd
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue Aug 15 10:02:25 2017 -0500

    Add vzeroupper to Intel AVX kernels.

commit 7628de3f76f78a44788807605a4601ddda445854
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 10 16:24:28 2017 -0500

    Removed trailing enum commas from bli_type_defs.h.
    
    Details:
    - Removed trailing commas from enums in bli_type_defs.h. Thanks to
      Erling Andersen for pointing out this inconsistency and suggesting
      the change.

commit a666fd4e267ffae3d4b21f38d569c61ff56adc9e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Aug 5 13:04:31 2017 -0500

    Added edge handling to _determine_blocksize_b().
    
    Details:
    - Added explicit handling of situations where i == dim to
      bli_determine_blocksize_b_sub(). This isn't actually needed by any
      current use case within BLIS, but handling the situation is nonetheless
      prudent. Thanks to Minh Quan for reporting this issue and requesting
      the fix.

commit 0c8afa546d7f33760415519ba328d7c49eb7aa06
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Aug 4 14:17:44 2017 -0500

    Fixed a minor bug in level-3 packm management.
    
    Details:
    - Fixed a bug in bli_l3_packm() that caused cntl_t-cached packed mem_t
      entries to be released and then re-acquired unnecessarily. (In essence,
      the "<" operands in the conditional that guards the
      release-and-reacquire code block simply needed to be swapped.) The bug
      should have only affected performance (rather than the computed result).
      Thanks to Minh Quan for identifying and reporting the bug.

commit 6cf68a185d83fa46d438fcef65258ace78e24b13
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Jul 31 15:19:51 2017 -0500

    Change lsame_ signature to match lapacke.

commit 6a9bd97295cc4fb1cbcd28f69824a43c073c9a76
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jul 29 20:17:05 2017 -0500

    Fixed pthreads compile bug with previous commit.
    
    Details:
    - Erroneously passed family parameter into l3int_t function despite
      that function not taking the parameter. Oops.

commit 95adc43d800431dc0a02ca83a51426dbef641ad6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jul 29 14:53:39 2017 -0500

    Moved 'family' field from cntx_t to cntl_t.
    
    Details:
    - Removed the family field inside the cntx_t struct and re-added it to the
      cntl_t struct. Updated all accessor functions/macros accordingly, as well
      as all consumers and intermediaries of the family parameter (such as
      bli_l3_thread_decorator(), bli_l3_direct(), and bli_l3_prune_*()). This
      change was motivated by the desire to keep the context limited, as much
      as possible, to information about the computing environment. (The family
      field, by contrast, is a descriptor about the operation being executed.)
    - Added additional functions to bli_blksz_*() API.
    - Added additional functions to bli_cntx_*() API.
    - Minor updates to bli_func.c, bli_mbool.c.
    - Removed 'obj' from bli_blksz_*() API names.
    - Removed 'obj' from bli_cntx_*() API names.
    - Removed 'obj' from bli_cntl_*(), bli_*_cntl_*() API names. Renamed routines
      that operate only on a single struct to contain the "_node" suffix to
      differentiate with those routines that operate on the entire tree.
    - Added enums for packm and unpackm kernels to bli_type_defs.h.
    - Removed BLIS_1F and BLIS_VF from bszid_t definition in bli_type_defs.h.
      They weren't being used and probably never will be.

commit a98e4aa547f61ab09dd91d11478c2a2ef9882e11
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Thu Jul 20 14:50:13 2017 -0500

    Clang can't make up it's mind what to support.

commit 32eb36c3e8c2add2528514272044de16faed0c8f
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Thu Jul 20 12:54:58 2017 -0500

    Add default #define for __has_extension.

commit 2a9aa134f7c29d3d4fdc160022ff257e61885a95
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Thu Jul 20 10:04:34 2017 -0500

    Add fallbacks to __sync_* or __c11_atomic_* builtins when __atomic_* is not supported. Fixes #143.

commit 6f07a034d575e1e9e30bb6417b8fcb77cf301297
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 19 15:40:48 2017 -0500

    Updated ar option list used by all configurations.
    
    Details:
    - Dropped 'u' from the list of modifiers passed into the library archiver
      ar. Previously, "cru" was used, while now we employ only "cr". This
      change was prompted by a warning observed on Ubuntu 16.04:
    
        ar: `u' modifier ignored since `D' is the default (see `U')
    
      This caused me to realize that the default mode causes timestamps to be
      zero, and thus the 'u' option, which causes only changed object files to
      be inserted, is not applicable.

commit 32bc03f9eed8795cfd2f2615d1c9f8673e039c57
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 19 13:51:53 2017 -0500

    Added --force-version=STRING option to configure.
    
    Details:
    - Added an option to configure that allows the user to force an arbitrary
      version string at configure-time. The help text also now describes the
      usage information.
    - Changed the way the version string is communicated to the Makefile.
      Previously, it was read into the VERSION variable from the 'version' file
      via $(shell cat ...). Now, the VERSION variable is instead set in
      config.mk (via a configure-substituted anchor from config.mk.in).

commit befaee6dd8b2a72de9e0461fe2ec1f36e9f88f3c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 18 17:56:00 2017 -0500

    Updated openmp/pthread barriers with GNU atomics.
    
    Details:
    - Updated the non-tree openmp and pthreads barriers defined in
      bli_thrcomm_openmp.c and bli_thrcomm_pthreads.c to instead call a common
      implementation in bli_thrcomm.c, bli_thrcomm_barrier_atomic(). This new
      implementation goes through the same motions as the previous codes, but
      protects its loads and increments with GNU atomic built-ins. These atomic
      statements take memory ordering parameters that allow us to specify just
      enough constraints for the barrier to work as intended on weakly-ordered
      hardware. The prior implementation was only guaranteed to work on systems
      with strongly- ordered memory. (Thanks to Devin Matthews for suggesting
      this change and his crash-course in atomics and memory ordering.)
    - Removed 'volatile' from structs' barrier field declarations in
      bli_thrcomm_*.h.
    - Updated bli_thrcomm_pthread.? files to use renamed struct barrier fields
      consistent with that of the _openmp.? files.
    - Updated other bli_thrcomm_* files to rename "communicator" variables to
      simply "comm".

commit 8f739cc847fcff2ddeeb336f8b2b9d080eb16f6c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jul 17 19:03:22 2017 -0500

    Added API to set mt environment variables.
    
    Details:
    - Renamed bli_env_get_nway() -> bli_thread_get_env().
    - Added bli_thread_set_env() to allow setting environment variables
      pertaining to multithreading, such as BLIS_JC_NT or BLIS_NUM_THREADS.
    - Added the following convenience wrapper routines:
        bli_thread_get_jc_nt()
        bli_thread_get_ic_nt()
        bli_thread_get_jr_nt()
        bli_thread_get_ir_nt()
        bli_thread_get_num_threads()
        bli_thread_set_jc_nt()
        bli_thread_set_ic_nt()
        bli_thread_set_jr_nt()
        bli_thread_set_ir_nt()
        bli_thread_set_num_threads()
    - Added #include "errno.h" to bli_system.h.
    - This commit addresses issue #140.
    - Thanks to Chris Goodyer for inspiring these updates.

commit 10163833075fd42be5b5b503acc855f91a484cfd
Author: Marat Dukhan <marat@fb.com>
Date:   Thu Jul 13 21:39:24 2017 -0700

    Fix Emscripten builds

commit c09b30d115eade72f44f37bf90aa848c9c0e79af
Author: Minh Quan HO <mqho@kalray.eu>
Date:   Fri Jul 7 10:52:05 2017 +0200

    set missing free_fp in bli_membrk_init for free-ing GEN_USE buffers
    
    The membrk's free_fp is called when releasing GEN_USE buffers, but this free_fp is
    not set in bli_membrk_init

commit 997628ed9793c72e9ef576dd8d715cfec27c4862
Author: sthangar <Santanu.Thangaraj@amd.com>
Date:   Fri Jun 30 12:23:19 2017 +0530

    Reducing the framework overhead of GEMV routines
    
    Change-Id: I83607ad767bff74e305e915b54b0ea34ec3e5684

commit ee869066168239b710ad9938bb0e1ae454883f3a
Author: Kiran Varaganti <Kiran.Varaganti@amd.com>
Date:   Tue Jul 4 12:57:32 2017 +0530

    Improved efficiency of dGEMM for large matrices by reducing TLB load misses and majorly L3 cache misses. This is achieved by changing the packed block sizes of matrix A & B. Now the optimum values are MC_D = 510 and KC_D = 1024.
    
    Change-Id: I2d8bdd5f62f2d1f8782ae2997f3d7a26587d1ca4

commit 7b933b90b1859c96de49a402d48de82909bc73e5
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue Jun 6 20:23:17 2017 -0500

    Add new SSI acknowledgment

commit 3485abba4b426fbf42b146a9611a0841f6d236c6
Author: sthangar <Santanu.Thangaraj@amd.com>
Date:   Wed May 24 11:48:16 2017 +0530

    Checked in the small matrix code to compute GEMM called with A transpose case
    
    Change-Id: I29f40046d43d7a4b037c1cb322503ee26495f462

commit de16beb83b29b4b9748f70db985b0fe04db85f7d
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri May 26 14:49:31 2017 -0400

    PACKDIM_MR=8 didn't work out, but messing with the prefetching helps 2%.

commit 25d0e618544b6eea7d3f13c7aec513ac0139801d
Author: Devin Matthews <dmatthews@gator3.ufhpc>
Date:   Fri May 26 14:47:36 2017 -0400

    Revert "Change PACKDIM_MR (double) for haswell to 8."
    
    This reverts commit 681eec913d7c2ebcff637cec5c1627ced9a92b99.

commit c5bdd84b35bc2a8ebf55b7763fb56c0c945be0cb
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri May 26 12:28:09 2017 -0500

    Change PACKDIM_MR (double) for haswell to 8.

commit 172789d562001293b973bbdd8015bd27d37292e8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed May 17 13:03:52 2017 -0500

    Restored deleted lines from makefile fragments.

commit 3ea9bd2c8e90dbd35655fa6a5b953dfea1f308fe
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed May 17 12:29:44 2017 -0500

    Change to /bin/sh.
    
    All scripts checked with Debian's checkbashisms. Also check for clang first in auto-detect.sh.

commit 49438409eedb98d3f0ebf00b8d1eee0ae45f4f8c
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed May 17 12:27:14 2017 -0500

    Remove shebangs from makefiles.

commit 497e2640474c016d576dce3530fa6a66891642a0
Author: J M Dieterich <dieterich@ogolem.org>
Date:   Tue May 16 23:11:22 2017 -0400

    Fix if/else structure. Thanks to TravisCI.

commit 835035c56a8de36ad25bb8d1375db170d489ef57
Author: J M Dieterich <dieterich@ogolem.org>
Date:   Tue May 16 22:23:27 2017 -0400

    Mark piledriver compilable w/ clang.

commit 6cdb533472ee61af297c1f948307abbf45828887
Author: J M Dieterich <dieterich@ogolem.org>
Date:   Tue May 16 22:12:12 2017 -0400

    Mark bulldozer compilable w/ clang.

commit a85697d62272da06d28cd1c947f6cf1098df6467
Author: J M Dieterich <dieterich@ogolem.org>
Date:   Tue May 16 22:06:59 2017 -0400

    Correct error message.

commit e0c64cad271058688a2b999caf8c2767dc3aef7e
Author: J M Dieterich <dieterich@ogolem.org>
Date:   Tue May 16 22:03:23 2017 -0400

    Indeed once can compile for carrizo also using clang.

commit 4aafe0505d3f0954d095ded5459a76976e5093b4
Author: J M Dieterich <dieterich@ogolem.org>
Date:   Tue May 16 21:50:49 2017 -0400

    A bunch of shebang fixes from unportable /bin/bash to portable /usr/bin/env bash

commit abaeaa68ea11e84be1810f564d6f38d506cbeb6a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri May 5 15:06:56 2017 -0500

    Fixed a bug in norm1v, norm1m.
    
    Details:
    - Fixed a bug that manifested as improperly-computed 1-norm for vectors
      and matrices. This is one of the few operations in BLIS that does not
      have its own test module within the testsuite, hence why it went
      undetected for so long. The bad 1-norms were being used to normalize
      matrices in the testsuite after initialization, which led to some
      matrices containing a combination of "large" and "small" values. This
      tended to push the residuals computed after each test away from zero.
      In some cases, they were off *just* enough to the testsuite to label
      it a "failure". Many thanks to Jeff Hammond for reporting this bug.
      (Wonky details: the bug was due to improperly-defined level-0 scalar
      macros for abval2, an operation that computes the absolute square,
      or complex magnitude/modulus. Certain complex domain instances of
      abval2 were being incorrectly defined in terms of real-only solutions,
      leading to bad results. This level-0 operation forms the basis of
      norm1v/norm1m. absq2 was also affected, but almost nothing uses
      this operation.)

commit cc3107ae1c2074f72b724aa748d2e5b4cb290ed5
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Thu May 4 10:35:22 2017 -0500

    Setting any one of BLIS_NT_[IJ][CR] overrides BLIS_NUM_THEADS. Missing BLIS_NT_XX's are defaulted to 1. Fixes #123.

commit c8ab91f70d399ee14edd30a3a5c46b24c5d2f910
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed May 3 15:04:51 2017 -0500

    Disable complex 3m/4m in testsuite by default.
    
    Details:
    - Disabled testsuite tests of all level-3 implementations based on 3m
      and 4m. This will improve testing runtime on Travis CI as well as for
      anyone manually running the testsuite using default test parameters.
      Thanks to Devin Matthews for suggesting this change.

commit 9700f0e5785007ddafb72a5ca83800dee61fd35c
Author: Jeff Hammond <jeff.science@gmail.com>
Date:   Tue May 2 19:25:21 2017 -0700

    allow KNL build without hbwmalloc.h (i.e. emulated)
    
    we want to be able to run BLIS KNL binaries on non-KNL machines via SDE.
    although it is possible to install hbwmalloc implementation on such
    systems, it is easier not to, since obviously the performance of SDE
    execution is not representative so there is no reason to emulate HBW
    allocation.

commit 17dcd5a33ff91967f67e7c0ba09b4f18754609a4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 2 16:48:43 2017 -0500

    Fixed stray parentheses in README citations.

commit 2910d44ff9e1d951d3249313f4ab39d18ea1b48d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 2 16:38:43 2017 -0500

    CHANGELOG update (0.2.2)

commit 5ca3863220e07972fcefc6682ddd3f6e54fe4a94
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 2 15:48:30 2017 -0500

    Fixed a trsm1m bug that affected right-side cases.
    
    Details:
    - Fixed a bug introduced in 1c732d3 that affected trsm1m_r. The result
      was nondeterministic behavior (usually segmentation faults) for certain
      problem sizes beyond the 1m instance of kc (e.g. 128 on haswell). The
      cause of the bug was my commenting out lines in bli_gemm1m_ukr_ref.c
      which explicitly directed the virtual gemm micro-kernel to use temporary
      space if the storage preference of the [real domain] gemm ukernel did
      not match the storage of the output matrix C. In the context of gemm,
      this handling is not needed because agreement between the storage pref
      and the matrix is guaranteed by a high-level optimization in BLIS.
      However, this optimization is not applied to trsm because the storage
      of C is not necessarily the same as the storage of the micro-panels of
      B--both of which are updated by the micro-kernel during a trsm
      operation. Thus, the guarantee of storage/preference agreement is not
      in place for trsm, which means we must handle that case within the
      virtual gemm micro-kernel.
    - Comment updates and a minor macro change to bli_trsm*_cntx_init() for
      3m1, 4m1a, and 1m.

commit 1af0b09f5c275ee7bac896cc6f36f42af721d9b5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 2 12:09:39 2017 -0500

    README.md update.
    
    Details:
    - Updated bibtex entries for 4th BLIS paper, and adds entries for 5th
      and 6th BLIS papers.

commit db4a0bb8ba7cd697d68be8e5632371ee3e59fd63
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Mar 17 12:07:27 2017 -0500

    Whitespace reformatting to armv8a kernels file.
    
    Details:
    - Updated formatting of function signature/header in
      kernels/armv8a/3/bli_gemm_opt_4x4.c.

commit e3eb01f6b990e205b15edcbaffd3d54b3ddd1ca4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 21 15:33:39 2017 -0600

    Disabled experiment-related 1m code.
    
    Details:
    - Commented out code in frame/ind/oapi/bli_l3_3m4m1m_oapi.c that was
      specifically inserted to facilitate the benchmarking of 1m block-panel
      and panel-block algorithms.
    - Updates to test/3m4m/Makefile, runme.sh script, and test_gemm.c to
      reflect changes used/needed during benchmarking.

commit 4f61528d56eed6a139eeac9db0c44e56f2d2d136
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jan 25 16:25:46 2017 -0600

    Added 1m-specific APIs for bp, pb gemm algorithms.
    
    Details:
    - Defined bli_gemmbp_cntl_create(), bli_gemmpb_cntl_create(), with the
      body of bli_gemm_cntl_create() replaced with a call to the former.
    - Defined bli_cntl_free_w_thrinfo(), bli_cntl_free_wo_thrinfo(). Now,
      bli_cntl_free() can check if the thread parameter is NULL, and if so,
      call the latter, and otherwise call the former.
    - Defined bli_gemm1mbp_cntx_init(), bli_gemm1mpb_cntx_init(), both in
      terms of bli_gemm1mxx_cntx_init(), which behaves the same as
      bli_gemm1m_cntx_init() did before, except that an extra bool parameter
      (is_pb) is used to support both bp and pb algorithms (including to
      support the anti-preference field described below).
    - Added support for "anti-preference" in context. The anti_pref field,
      when true, will toggle the boolean return value of routines such as
      bli_cntx_l3_ukr_eff_prefers_storage_of(), which has the net effect of
      causing BLIS to transpose the operation to achieve disagreement (rather
      than agreement) between the storage of C and the micro-kernel output
      preference. This disagreement is needed for panel-block implementations,
      since they induce a transposition of the suboperation immediately before
      the macro-kernel is called, which changes the apparent storage of C. For
      now, anti-preference is used only with the pb algorithm for 1m (and not
      with any other non-1m implementation).
    - Defined new functions,
        bli_cntx_l3_ukr_eff_prefers_storage_of()
        bli_cntx_l3_ukr_eff_dislikes_storage_of()
        bli_cntx_l3_nat_ukr_eff_prefers_storage_of()
        bli_cntx_l3_nat_ukr_eff_dislikes_storage_of()
      which are identical to their non-"eff" (effectively) counterparts except
      that they take the anti-preference field of the context into account.
    - Explicitly initialize the anti-pref field to FALSE in
      bli_gks_cntx_set_l3_nat_ukr_prefs().
    - Added bli_gemm_ker_var1.c, which implements a panel-block macro-kernel
      in terms of the existing block-panel macro-kernel _ker_var2(). This
      technique requires inducing transposes on all operands and swapping
      the A and B.
    - Changed bli_obj_induce_trans() macro so that pack-related fields are
      also changed to reflect the induced transposition.
    - Added a temporary hack to bli_l3_3m4m1m_oapi.c that allows us to easily
      specify the 1m algorithm (block-panel or panel-block).
    - Renamed the following cntx_t-related macros:
        bli_cntx_get_pack_schema_a() -> bli_cntx_get_pack_schema_a_block()
        bli_cntx_get_pack_schema_b() -> bli_cntx_get_pack_schema_b_panel()
        bli_cntx_get_pack_schema_c() -> bli_cntx_get_pack_schema_c_panel()
      and updated all instantiations. Also updated the field names in the
      cntx_t struct.
    - Comment updates.

commit 1d728ccb2394e77365e7c42683db6579c5fba014
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Nov 25 18:29:49 2016 -0600

    Implemented the 1m method.
    
    Details:
    - Implemented the 1m method for inducing complex domain matrix
      multiplication. 1m support has been added to all level-3 operations,
      including trsm, and is now the default induced method when native
      complex domain gemm microkernels are omitted from the configuration.
    - Updated _cntx_init() operations to take a datatype parameter. This was
      needed for the corresponding function for 1m (because 1m requires us
      to choose between column-oriented or row-oriented execution, which
      requires us to query the context for the storage preference of the
      gemm microkernel, which requires knowing the datatype) but I decided
      that it made sense for consistency to add the parameter to all other
      cntx initialization functions as well, even though those functions
      don't use the parameter.
    - Updated bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs() to take
      a second scalar for each blocksize entry. The semantic meaning of the
      two scalars now is that the first will scale the default blocksize
      while the second will scale the maximum blocksize. This allows scaling
      the two independently, and was needed to support 1m, which requires
      scaling for a register blocksize but not the register storage
      blocksize (ie: "packdim") analogue.
    - Deprecated bli_blksz_reduce_dt_to() and defined two new functions,
      bli_blksz_reduce_def_to() and bli_blksz_reduce_max_to(), for reducing
      default and maximum blocksizes to some desired blocksize multiple.
      These functions are needed in the updated definitions of
      bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs().
    - Added support for the 1e and 1r packing schemas to packm, including
      1e/1r packing kernels.
    - Added a minor optimization to bli_gemm_ker_var2() that allows, under
      certain circumstances (specifically, real domain beta and row- or
      column-stored matrix C), the real domain macrokernel and microkernel
      to be called directly, rather than using the virtual microkernel
      via the complex domain macrokernel, which carries a slight additional
      amount of overhead.
    - Added 1m support to the testsuite.
    - Added 1m support to Makefile and runme.sh in test/3m4m. Also simplified
      some code in test_gemm.c driver.

commit 0d1b90286e29aa8b768e280b5286d92c02ad87a1
Author: Jeff Hammond <jeff.science@gmail.com>
Date:   Tue Oct 25 21:15:26 2016 -0700

    never use libm with Intel compilers
    
    Intel compilers include a highly optimized math library (libimf) that
    should be used instead of GNU libm.
    
    yes, this change is for ALL targets, including those that are not
    supported by the Intel compiler.  there is no harm in doing this, and it
    is future-proof in the event that the Intel compilers support other
    architectures.

commit b150870397e7aee558e61d1bd72a0c0d1d99bee8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Dec 8 16:08:41 2017 -0600

    Removed most "old" directories.
    
    Details:
    - Removed the vast majority of directories named "old", which contained
      deprecated code that I wasn't quite ready to jettison from the source
      tree.

commit 270c65985df849297ba1951aa3b56c03948d7775
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Dec 8 15:21:18 2017 -0600

    Modified bli_getopt() for thread-safety.
    
    Details:
    - Changed the interface of bli_getopt() to take a new argument, a getopt_t
      struct, that stores the values of optarg, optind, opterr, and optopt,
      and updated the implementation accordingly. (Previously,  these
      variables were assumed to be global.)
    - Added a function for initializing a getopt_t struct.
    - Changed test_libblis.c--currently the only consumer of bli_getopt()--to
      utilize the new getopt_t state object.

commit ce4d8fabc2e39371f89c12192fb707be82ae021a
Merge: 39be59f2 e05a8dfa
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Dec 7 17:36:44 2017 -0600

    Merge branch 'master' of github.com:flame/blis

commit 39be59f2a8470f40475907d9dd52639b8a911a92
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Dec 7 17:35:20 2017 -0600

    Replaced several macros with static function APIs.
    
    Details:
    - Reimplemented several sets of get/set-style preprocessor macros with
      static functions, including those in the following frame/base headers:
      auxinfo, cntl, mbool, mem, membrk, opid, and pool. A few headers in
      frame/thread were touched as well: mutex_*, thrcomm, and thrinfo.

commit e05a8dfa7cc7df41e966c1ad04e51c482b308b23
Merge: 79507337 4423e33d
Author: dnp <devangiparikh@gmail.com>
Date:   Wed Dec 6 16:45:24 2017 -0600

    Merge branch 'rt'

commit 4423e33dc593115cda92c5763d756d7ad1298aa9
Author: dnp <devangiparikh@gmail.com>
Date:   Wed Dec 6 16:35:03 2017 -0600

    Adding SKX kernels and configuration.

commit 79507337e140daec7639f6eb3ed9cfe6e123d342
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Dec 6 16:21:35 2017 -0600

    Various checks to ensure that arch_t id is in range.
    
    Details:
    - Expanded checking of the arch_t id in bli_gks.c--either passed in from
      the caller or as returned from bli_arch_query_id()--against the expected
      range of id values. Thanks to Devangi Parikh for suggesting these
      additional sanity checks.

commit fde7c1126c58373ecde83471890b257399144876
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 4 16:11:01 2017 -0600

    Added 'uninstall-old-headers' target to Makefile.
    
    Details:
    - Defined a new 'uninstall-old-headers' target that allows users of BLIS to
      uninstall no-longer-needed headers left over from previous installations.
    - Fixed the 'uninstall-old' target so that it will install both .a and .so
      libraries.
    - Renamed 'uninstall-old' to 'uninstall-old-libs'.
    - Added 'uninstall-old' target (different from previous 'uninstall-old'
      target) that combines 'uninstall-old-libs' and 'uninstall-old-headers'.

commit d4ee770bde213a87aa6049245145318324dc6b51
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 4 14:53:43 2017 -0600

    Create/install monolithic cblas.h.
    
    Details:
    - When CBLAS is enabled at configure-time, BLIS now creates a monolithic
      cblas.h using the same flatten-header.sh script that was recently
      introduced for creating monolithic blis.h header files. The top-level
      Makefile will also install this cblas.h file into the install prefix
      alongside blis.h when the 'install' target is invoked. The two header
      files are compatible with one another. Regardless whether the user's
      source #includes cblas.h, both blis.h and cblas.h, or just blis.h,
      the user will get the CBLAS function prototypes and enums, as expected.

commit 52f9e6f1b6468785af8947317656445d4729fc8b
Merge: ab57b979 21360dd8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Dec 1 12:28:09 2017 -0600

    Merge branch 'rt'

commit 21360dd8e2c7287100645e109acaabcc6ba1140c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 29 14:11:34 2017 -0600

    Fixed cntx_t packm query when ker_id > _NUM_PACKM_KERS.
    
    Details:
    - Fixed a subtle bug in bli_cntx_get_[un]packm_ker_dt() in which the
      function fails to return NULL when passed a kernel id argument that is
      equal to or beyond BLIS_NUM_[UN]PACKM_KERS. Instead, the function was
      attempting to index into the cntx_t's packm kernel array, which resulted
      in undefined behvaior. Thanks to Devangi Parikh for finding this bug.

commit 244a6f4e66e8ff091e995f8090ce779c1928aa8b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Nov 28 17:48:48 2017 -0600

    Fixed POSIX sed non-compliance in flatten-header.sh.
    
    Details:
    - Changed GNU usage of 'i' and 'a' sed commands used in flatten-header.sh
      to POSIX-compliant usage that will work on OS X's sed.

commit 45078621676833e53a2878af8f89479c4f93b8ab
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Nov 28 15:16:22 2017 -0600

    Generate/compile with/install monolithic blis.h.
    
    Details:
    - Rewrote monolithify-header.sh (and renamed to flatten-header.sh) so that
      headers are inserted recursively. This improves performance by a factor
      of 3-4x.
    - Modified configure to create an 'include/<configname>' directory in which
      make can create a monolithic header.
    - Modified the top-level Makefile so that a monolithic header is generated
      unconditionally prior to compilation (stored in include/<configname>) and
      so that the single header is installed instead of the 450 or so header
      files that reside throughout the framework source tree.
    - Added "include/*/*.h" to .gitignore file.
    - Removed some pnacl/emscripten leftovers that I intended to include in
      a1caeba (mostly in testsuite/Makefile).
    - Trivial comment changes to frame/include/bli_f2c.h.

commit 1f30b1301bf6d6047ec29e57a5fde8eb1072a0ee
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Nov 25 16:54:26 2017 -0600

    Added missing framework support for x86_64 family.
    
    Details:
    - Added support for the x86_64 configuration family to bli_arch.c and
      bli_arch_config.h. Thanks to Johannes Dieterich for reporting this
      issue.
    - Bumped the default value for BLIS_SIMD_NUM_REGISTERS from 16 to 32 and
      the default value for BLIS_SIMD_SIZE from 32 to 64. This will support
      configuration families that include Skylake and newer processors without
      any supported needed in the bli_family_*.h file. The semantics of these
      values have always been "maximum" and not exact values; comments in
      bli_kernel_macro_defs.h and the github wiki have been adjusted
      accordingly.

commit 9f39806c4ed484c9ed13edf96005838d977722a9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Nov 21 16:03:56 2017 -0600

    Fixed a bug in e31f0b3/b131b9a.
    
    Details:
    - Erroneously placed the "don't overwrite existing blocksize" logic in
      bli_blksz_init*() rather than in bli_cntx_set_blkszs(). It belongs in
      the latter because that function copies blocksizes as-is from the
      blksz_t function argument to the appropriate field in the cntx_t. If
      the blksz_t was previously initialized selectively, based on the sign
      of the blocksize value passed into bli_blksz_init*(), that just leaves
      some fields possibly uninitialized (with garbage values), which
      definitely will not work.
    - The aforementioned logic has been moved to bli_cntx_set_blkszs() via
      a new function bli_blksz_copy_if_pos(), which selectively copies only
      the blocksizes that are greater than zero.

commit b131b9a025c15f548d4c2952a9ec85eee3d139b1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Nov 21 14:30:26 2017 -0600

    Updated configs to omit setting some blocksizes.
    
    Details:
    - Employ the new semantics of bli_blksz_init*() in e31f0b3 in various
      sub-configurations' bli_cntx_init_*() functions by passing in 0 for
      register and cache blocksizes that correpond to gemm microkernel
      datatypes that were not registered, allowing the default values
      set by the bli_cntx_init_*_ref() function call to remain.

commit 499a4c002f895744ecaf81ef7f62d2d6d0d7d594
Merge: e31f0b3e 6c3ba502
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Nov 21 14:25:08 2017 -0600

    Merge branch 'rt' of github.com:flame/blis into rt

commit e31f0b3e2dba19ca8a2946bc21beb136a42d0f57
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Nov 21 14:21:25 2017 -0600

    Subtle update to bli_blksz_init*() API.
    
    Details:
    - Updated the semantics of bli_blksz_init() and bli_blksz_init_ed() so
      that non-positive blocksize values are ignored entirely. This provides
      an easy way to indicate that certain existing values should not be
      touched by the update. Thanks to Devangi Parikh for feedback that led
      to these changes.

commit 6c3ba502a11f87bc67555d26154cfd39d0af1bac
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Nov 21 13:50:53 2017 -0600

    Added 'x86_64' sub-config directory.
    
    Details:
    - Added missing x86_64 configuration directory, which was intended to be
      part of b7ca580.
    - Added -Wfatal-errors compiler warning flag to all configurations so that
      compilation stops after the first error.
    - Changed the vectorization flags for intel64 configuration to be compatible
      with 'penryn', the oldest sub-config included in that family.
    - Changed the vectorization flags for penryn to target the 'core2'
      microarchitecture and ssse3.

commit 25eee3cc49b0631812485d4d5ceef0c23ed1b6dd
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Nov 21 12:34:20 2017 -0600

    Added a dummy file to kernels/generic.
    
    Details:
    - Added a dummy file to kernels/generic, which was previously empty, so
      that git would begin tracking the otherwise-empty directory. This
      directory's existence is necessary for proper execution of configure
      for any configuration family that contains the 'generic'
      sub-configuration. Thanks to Johannes Dieterich for reporting the
      issue that led to this fix.

commit ef024ce4cafa217669eaabb31ff8ab6df93cca05
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 20 18:08:29 2017 -0600

    More tweaks to monolithify-header.sh
    
    Details:
    - Further fixes monolithify-header.sh script.
    - Removed unnecessary #include "blis.h" from frame/3/bli_l3_packm.h.

commit 5028e7dec269b62895511453272585da36e591b5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 20 17:00:37 2017 -0600

    Second attempt to implement travis_wait.
    
    Details:
    - Corrected accidental misplacement of the travis_wait prefix (on the
      wrong line of the .travis.yml file) in commit 13e5d91.

commit 13e5d9107b3763cba46fb1bae87476852601b47c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 20 15:57:06 2017 -0600

    Added travis_wait prefix to testsuite via Travis.
    
    Details:
    - It appears that Travis CL has implemented a new policy that results in
      a test failing if it does not produce any output for more than 10
      minutes. (Two test instances are now failing in Travis despite the most
      recent commit not affecting the library or testsuite.) This issue can
      be worked around by executing the test run via travis_wait, which takes
      an optional time parameter. This commit attempts to use 'travis_wait 30'
      in the .travis.yml file to prevent the early failure at 10 minutes.

commit a1caeba0ea79c8fecb1abadca1f91c6367ab3afb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 20 13:31:20 2017 -0600

    Removed pnacl, emscripten support from Makefile.

commit 78199c539beaa50f37893add220261ce0dcb921a
Merge: b3d8ab2e ab57b979
Author: praveeng <praveen.g@amd.com>
Date:   Mon Nov 20 15:51:20 2017 +0530

    Merge master code till 01-Nov-2017 to amd-staging
    
    Change-Id: I40b53f876db84c8b947b3f2385c9b882245c6603

commit 9df6dda9ec51a0d40166169d2d8a2f84b42266e6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Nov 18 19:03:26 2017 -0600

    Improvements, bugfixes to monolithify-header.sh.

commit 21d26201f90b884eb8d5de279ed74bbd244ffcb5
Merge: 43baa3b3 b7ca5806
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Nov 18 14:16:53 2017 -0600

    Merge branch 'rt' of github.com:flame/blis into rt

commit 43baa3b327d5ae1e2ba619432687b4dd849b05e3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Nov 18 14:14:44 2017 -0600

    Removed unnecessary flags for generic config.
    
    Details:
    - Removed -D_POSIX_C_SOURCE=200112L and -m64 flags from make_defs.mk file
      of generic sub-configuration. These flags are generally not necessary,
      and particularly not desirable for the generic configuration since they
      unnecessarily restrict the environments in which the configuration can
      be built.

commit b7ca580618f9382b7982168fd035ed058f83e4c2
Author: iotamudelta <dieterich@ogolem.org>
Date:   Sat Nov 18 14:56:05 2017 -0500

    [WIP] Add x86 and x86_64 processor families. (#154)
    
    * Add x86 and x86_64 processor families.
    * Use generic config as fallback for more families.
    
    After discussion with fgvanzee, a) it's "generic" and 2) use it for all the families as a fallback. Goal is that if a specific CPU is not yet supported by a family (say a new Intel microarchitecture on x86_64), it'll fall through to still work with the slower "generic" kernels

commit 870597d1663aaba1b74d7654b1d4946280aa0d3f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Nov 17 17:06:42 2017 -0600

    Added bash script for creating monolithic headers.
    
    Details:
    - Added a new script, monolithify-header.sh, to the 'build' directory.
      This script recursively replaces all #include directives in a selected
      file with the contents of the header files referenced by each directive.
      The idea is to "flatten" a tree of .h files into a single file, with
      the script acting as a C preprocessor that only processes #include
      directives.

commit c76f77f4cc1e71988251c5e63cf6ef137477bf9c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Nov 17 15:10:52 2017 -0600

    Removed unnecessary #include "blis.h" from header.
    
    Details:
    - Removed an errant #include "blis.h directive from bli_cntx_ind_stage.h.
      The generaly policy is that no header file in BLIS should include
      blis.h. This will be important in the near future when using a tool to
      recursively create a monolithic blis.h file from its consitutent
      headers.

commit 2bb9bc6e9536fa239fbc19a7efaaf151116e15b4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Nov 17 13:50:14 2017 -0600

    Miscellaneous tweaks to gks, rt functionality.
    
    Details:
    - Updated bli_cpuid_query_id() so that BLIS_ARCH_GENERIC is always returned
      if the hardware fails to test positive for any supported sub-configuration.
    - Defined bli_gks_init_ref_cntx(), which will call the context initialization
      function bli_cntx_init_configname() for the sub-configuration 'configname'
      associated with the arch_t id returned by bli_arch_query_id(). This makes
      initializing a reference context easy for experts who wish to construct
      those contexts.

commit b3d8ab2ea02c127ab241532abc214624f35bfaab
Merge: 189ffbb0 fe71c06e
Author: Santanu Thangaraj <Santanu.Thangaraj@amd.com>
Date:   Wed Nov 15 01:33:12 2017 -0500

    Merge "Added AMD copyright line to the changed files in last 3 commits" into amd-staging

commit fe71c06e42b072407c83112779055b0afb67173d
Author: Nisanth M P <nisanth.padinharepatt@amd.com>
Date:   Wed Nov 15 11:11:17 2017 +0530

    Added AMD copyright line to the changed files in last 3 commits
    
    Change-Id: I37d5dbbbe1b199e07529610a5e9cc9e49d067c66

commit d5bf79e50bf97072bbe7117c86b7c45e6e707ea0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 13 14:24:29 2017 -0600

    Miscellaneous tweaks and fixes.
    
    Details:
    - Fixed incorrect calling sequence in bli_cntx_init_knl.c--an instance of
      bli_blksz_init_easy() that should have been bli_blksz_init().
    - Fixed a bug in code that is supposed to output the list of sub-directories
      in the 'config' directory when configure script is run with no arguments.
    - Expanded the output of "make showconfig" to include more info from config.mk.
    - Minor changes to build/auto-detect/cpuid_x86.c, mostly in preparation for
      someone to add excavator and zen support.
    - Added a link to the ConfigurationHowTo wiki to config_registry.
    - Other minor tweaks to configure.

commit 673e5184030532c4ebd9fdeecbaa6442bb3ad54f
Merge: 2c51356a 8f150f28
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 1 17:37:42 2017 -0500

    Merge branch 'rt' of github.com:flame/blis into rt

commit 2c51356a8b2699c99f9507c80d69c08a35d45fe3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 1 17:37:02 2017 -0500

    Implemented runtime hardware detection via cpuid.
    
    Details:
    - Added runtime support for selecting an appropriate arch_t value based
      on the results of the cpuid instruction (for x86_64). This allows
      deferral of choosing a context (kernels, blocksizes, etc.) until
      runtime, which allows BLIS to be built with support for multiple
      microarchitectures. Currently, only amd64 and intel64 configurations
      are registered in the config_registry; however, one could create
      custom configuration families to support arbitrary sets of x86_64
      microarchitectures.
    - Current Intel microarchitectures supported via cpuid are knl, haswell,
      sandybridge, and penryn.
    - Current AMD microarchitectures supported via cpuid are: zen, excavator,
      steamroller, piledriver, and bulldozer.

commit ab57b979046479bcda7f83165838a80117c2ad95
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 1 11:51:41 2017 -0500

    Revert to default SIMD alignment for bulldozer.
    
    Details:
    - Removed the default-overriding #define of BLIS_SIMD_ALIGN_SIZE set in
      config/bulldozer/bli_kernel.h. Not sure where this value came from, but
      it would seem to allow for insufficient starting address alignment for
      any matrices created via bli_malloc_user(), such as via
      bli_obj_create(). Thanks to Rene Sitt for reporting the behavior that
      led us to this bug.
    - This commit is a manual patch of the same fix made to the 'rt' branch
      in 8f150f2.

commit 8f150f28a678c4a0c1591400177ad7cca81fcaec
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 1 11:41:45 2017 -0500

    Revert to default SIMD alignment for bulldozer.
    
    Details:
    - Removed the default-overriding #define of BLIS_SIMD_ALIGN_SIZE set in
      bli_family_bulldozer.h. Not sure where this value came from, but it
      would seem to allow for insufficient starting address alignment for
      any matrices created via bli_malloc_user(), such as via
      bli_obj_create(). Thanks to Rene Sitt for reporting the behavior that
      led us to this bug.

commit e3f10557caf114441fbfff990e3ce3576c177bdc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 30 13:37:54 2017 -0500

    Use perl for some substitution for OS X compatibility.
    
    Details:
    - Discovered that sed commands where the replacement string contains '\n'
      are problematic with the version of sed present in OS X. For these cases
      cases in the configure script, we instead use 'perl -pe' for
      search-and-replace functionality.
    - Various other minor comment/whitespace tweaks to configure.
    - Removed remaining lines of code related to setting/checking variables to
      track "unregistered" configurations.

commit dd45cfdfc3d8f9acf4cf7f69138d9b83dafc8842
Merge: 3e4f42a4 f60c827b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 30 12:23:05 2017 -0500

    Merge branch 'master' into rt

commit f60c827ba95f452c8454fb914f5564f4895bf644
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Oct 30 10:04:42 2017 -0500

    Fix CVECFLAGS for bulldozer config.

commit 3e4f42a4d2ebb37b95988933d92e561c5b2cc201
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 27 11:41:37 2017 -0500

    Typecast l1mkr_t enum value prior to comparison.
    
    Details:
    - Typecast l1mkr_t enum value in bli_cntx.h to guint_t before testing for
      out-of-range value. This is an attempt to pacify a strange warning from
      clang on OS X that is seemingly the result of the following compiler
      warning flag:
        -Wtautological-constant-out-of-range-compare

commit aec6e038d942d35b81bbd723a640cce2c054fb8e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 26 16:12:36 2017 -0500

    Removed associative arrays from configure.
    
    Details:
    - Implemented a replacement for associative arrays in the configure script
      that does not utilize arrays, and therefore works in pre-4.0 versions of
      bash. (It appears that Mac OS X will be stuck with version 3.2 indefinitely
      due to bash switching to the GPL 3.0 license starting with version 4.0.)

commit 189ffbb0d37262b21acddc0d35b4a22f2cbbca94
Merge: 06e0e635 3eb44f67
Author: Santanu Thangaraj <Santanu.Thangaraj@amd.com>
Date:   Wed Oct 25 02:00:30 2017 -0400

    Merge changes Ie115b206,I7ce6cfa2,Iff59b6f4 into amd-staging
    
    * changes:
      Adding __attribute__((constructor/destructor)) for CLANG case.
      Thread Safety: Move bli_init() before and bli_finalize() after main()
      Thread safety: Make the global induced method status array local to thread

commit 3eb44f67618b91ae5f5f0aaaba67e38f16042ee4
Author: Nisanth M P <nisanth.padinharepatt@amd.com>
Date:   Tue Oct 24 16:36:36 2017 +0530

    Adding __attribute__((constructor/destructor)) for CLANG case.
    
    CLANG supports __attribute__, but its documentation doesn't
    mention support for constructor/destructor. Compiling with
    clang and testing shows that it does support this.
    
    Change-Id: Ie115b20634c26bda475cc09c20960d687fb7050b

commit 07c352188bf5265af242255f8e6fcb97050d973d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 23 16:59:22 2017 -0500

    Added "generic" configuration.
    
    Details:
    - Added a "generic" configuration that leaves the default blocksizes and
      kernels unchanged. This replaces the older "reference" configuration.
      Updated auto-detect script and code accordingly.
    - Added support for generic configuration to arch_t (bli_type_defs.h),
      bli_gks_init() (bli_gks.c), and bli_arch_config.h
    - Moved bli_arch_query_id() to bli_arch.c (and prototype to bli_arch.h).
    - Whitespace changes to configurations' make_defs.mk files.

commit c1a98d6f70608b02a1e6bcad6ba020a60773dace
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 23 14:24:41 2017 -0500

    Minor update to .travis.yml file.

commit 75b9383f01caa8b83f8be0117e15085b0d807ba6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 20 16:41:22 2017 -0500

    Minor header renaming ahead of bli_arch.c.
    
    Details:
    - Renamed the various configurations' "bli_arch_<configname>.h" header files
      (replacing "arch" with "family") to free up the 'bli_arch' namespace for a
      different purpose (hardware detection).
    - Renamed "bli_arch.h" and "bli_arch_pre_macro_defs.h" in frame/include to
      "bli_arch_config.h" and "bli_arch_config_pre.h", respectively.

commit 482af51add26d5ed103c3e3f167657f273b32c7a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 20 15:44:26 2017 -0500

    Fixed 'make test' target from top-level Makefile.
    
    Details:
    - Updated the top-level Makefile's build rule for testsuite object files to
      properly obtain CFLAGS via get-frame-cflags-for() function instead of
      simply using the $(CFLAGS) variable (which is empty). This means that
      'make test' should now work as expected.

commit 3c269f700d207efe6c04193f09d519c88c1d4045
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 20 13:57:21 2017 -0500

    Makefile updates for test drivers, testsuite.
    
    Details:
    - Fixed semi-broken testsuite Makefile and very-broken test driver Makefiles,
      as well as those for test/3m4m, test/thread_ranges, and test/exec_sizes
      sub-directories.
    - Factored out much of the top-level Makefile into common.mk. A Makefile
      needs only set DIST_PATH to the relative path to the top level of the
      BLIS source distribution before including common.mk in order to acquire
      all of the definitions typically needed in a Makefile that tests BLIS.

commit 0557189d463446b4c32077cdcf0467fa71ca68dc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 18 15:05:27 2017 -0500

    Minor updates to .travis.yml, configure script.

commit 2553734d1d62043793f4e783a027349ef6d4d563
Merge: 453deb29 37534279
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 18 13:46:50 2017 -0500

    Merge branch 'master' into rt

commit 375342799cbae981c28d831793af588d7951f3f6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 18 13:41:25 2017 -0500

    Removed a duplicate bli_avx512_macros.h header.
    
    Details:
    - Removed a duplicate header file that was causing problems during
      installation for the 'knl' configuration. Thanks to Victor Eijkhout
      for reporting this issue.

commit 453deb29068889698e274f269c9aa90eea99b527
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 18 13:29:32 2017 -0500

    Implemented runtime kernel management.
    
    Details:
    - Reworked the build system around a configuration registry file, named
      config_registry', that identifies valid configuration targets, their
      constituent sub-configurations, and the kernel sets that are needed by
      those sub-configurations. The build system now facilitates the building
      of a single library that can contains kernels and cache/register
      blocksizes for multiple configurations (microarchitectures). Reference
      kernels are also built on a per-configuration basis.
    - Updated the Makefile to use new variables set by configure via the
      config.mk.in template, such as CONFIG_LIST, KERNEL_LIST, and KCONFIG_MAP,
      in determining which sub-configurations (CONFIG_LIST) and kernel sets
      (KERNEL_LIST) are included in the library, and which make_defs.mk files'
      CFLAGS (KCONFIG_MAP) are used when compiling kernels.
    - Reorganized 'kernels' directory into a "flat" structure. Renamed kernel
      functions into a standard format that includes the kernel set name
      (e.g. 'haswell'). Created a "bli_kernels_<kernelset>.h" file in each
      kernels sub-directory. These files exist to provide prototypes for the
      kernels present in those directories.
    - Reorganized reference kernels into a top-level 'ref_kernels' directory.
      This directory includes a new source file, bli_cntx_ref.c (compiled on
      a per-configuration basis), that defines the code needed to initialize
      a reference context and a context for induced methods for the
      microarchitecture in question.
    - Rewrote make_defs.mk files in each configuration so that the compiler
      variables (e.g. CFLAGS) are "stored" (renamed) on a per-configuration
      basis.
    - Modified bli_config.h.in template so that bli_config.h is generated with
      #defines for the config (family) name, the sub-configurations that are
      associated with the family, and the kernel sets needed by those
      sub-configurations.
    - Deprecated all kernel-related information in bli_kernel.h and transferred
      what remains to new header files named "bli_arch_<configname>.h", which
      are conditionally #included from a new header bli_arch.h. These files
      are still needed to set library-wide parameters such as custom
      malloc()/free() functions or SIMD alignment values.
    - Added bli_cntx_init_<configname>.c files to each configuration directory.
      The files contain a function, named the same as the file, that initializes
      a "native" context for a particular configuration (microarchitecture). The
      idea is that optimized kernels, if available, will be initialized into
      these contexts. Other fields will retain pointers to reference functions,
      which will be compiled on a per-configuration basis. These bli_cntx_init_*()
      functions will be called during the initialization of the global kernel
      structure. They are thought of as initializing for "native" execution, but
      they also form the basis for contexts that use induced methods. These
      functions are prototyped, along with their _ref() and _ind() brethren, by
      prototype-generating macros in bli_arch.h.
    - Added a new typedef enum in bli_type_defs.h to define an arch_t, which
      identifies the various sub-configurations.
    - Redesigned the global kernel structure (gks) around a 2D array of cntx_t
      structures (pointers to cntx_t, actually). The first dimension is indexed
      over arch_t and the inner dimension is the ind_t (induced method) for
      each microarchitecture. When a microarchitecture (configuration) is
      "registered" at init-time, the inner array for that configuration in the
      2D array is initialized (and allocated, if it hasn't been already). The
      cntx_t slot for BLIS_NAT is initialized immediately and those for other
      induced method types are initialized and cached on-demand, as needed. At
      cntx_t registration, we also store function pointers to cntx_init functions
      that will initialize (a) "reference" contexts and (b) contexts for use with
      induced methods. We don't cache the full contexts for reference contexts
      since they are rarely needed. The functions that initialize these two kinds
      of contexts are generated automatically for each targeted sub-configuration
      from cpp-templatized code at compile-time. Induced method contexts that
      need "stage" adjustments can still obtain them via functions in
      bli_cntx_ind_stage.c.
    - Added new functions and functionality to bli_cntx.c, such as for setting
      the level-1f, level-1v, and packm kernels, and for converting a native
      context into one for executing an induced method.
    - Moved the checking of register/cache blocksize consistency from being cpp
      macros in bli_kernel_macro_defs.h to being runtime checks defined in
      bli_check.c and called from bli_gks_register_cntx() at the time that the
      global kernel structure's internal context is initialized for a given
      microarchitecture/configuration.
    - Deprecated all of the old per-operation bli_*_cntx.c files and removed
      the previous operation-level cntx_t_init()/_finalize() invocations.
      Instead, we now query the gks for a suitable context, usually via
      bli_gks_query_cntx().
    - Deprecated support for the 3m2 and 3m3 induced methods. (They required
      hackery that I was no longer willing to support.)
    - Consolidated the 1e and 1r packm kernels for any given register blocksize
      into a single kernel that will branch on the schema and support packing
      to both formats.
    - Added the cntx_t* argument to all packm kernel signatures.
    - Deprecated the local function pointer array in all bli_packm_cxk*.c files
      and instead obtain the packm kernel from the cntx_t.
    - Added bli_calloc_intl(), which serves as the calloc-equivalent to to
      bli_malloc_intl(). Useful when we wish to allocate and initialize to
      zero/NULL.
    - Converted existing cpp macro functions defined in bli_blksz.h, bli_func.h,
      bli_cntx.h into static functions.

commit 4607aac297e55ad540cbe5fffbe02e6b1889c181
Author: Nisanth M P <nisanth.padinharepatt@amd.com>
Date:   Mon Oct 16 22:06:57 2017 +0530

    Thread Safety: Move bli_init() before and bli_finalize() after main()
    
    BLIS provides APIs to initialize and finalize its global context.
    One application thread can finalize BLIS, while other threads
    in the application are stil using BLIS.
    
    This issue can be solved by removing bli_finalize() from API.
    One way to do this is by getting bli_finalize() to execute by default
    after application exits from main().
    
    GCC supports this behaviour with the help of __attribute__((destructor))
    added to the function that need to be executed after main exits.
    
    Similarly bli_init() can be made to run before application enters main()
    so that application need not call it.
    
    Change-Id: I7ce6cfa28b384e92c0bdf772f3baea373fd9feac

commit 0f5ce26fc597cda6e8ae93a7526f52eb8cba01e9
Author: Nisanth M P <nisanth.padinharepatt@amd.com>
Date:   Mon Oct 16 21:07:50 2017 +0530

    Thread safety: Make the global induced method status array local to thread
    
    BLIS retains a global status array for induced methods, and provides
    APIs to modify this state during runtime. So, one application thread
    can modify the state, before another starts the corresponding
    BLIS operation.
    
    This patch solves this issue by making the induced method status array
    local to threads.
    
    Change-Id: Iff59b6f473771344054c010b4eda51b7aa4317fe

commit b882648af87deb1b365fc6b3e94151e69c5ccfa4
Merge: 8b379069 e02d3cb8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 11 16:32:21 2017 -0500

    Merge branch 'master' into rt

commit 06e0e6351acb9481225975ad9a4e0b8925336621
Author: sthangar <Santanu.Thangaraj@amd.com>
Date:   Thu Sep 28 12:15:36 2017 +0530

    The inner loop paralleization is turned off by default, the JR and IR loop parameters are set to 1 by default
    
    Change-Id: I8c3c2ecbbd636259f6ffb92768ec04148205c3e5

commit e02d3cb84190a345ebe9b32f53db03a1838976b1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 26 19:02:53 2017 -0500

    Fixed a pthread typo in previous commit.
    
    Details:
    - Misnamed 'pthread_mutex_t' type in bli_memsys.c as 'thread_mutex_t'.

commit f5962a1aae0fb3c9be104d0035c0d73210e7f670
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 26 17:00:04 2017 -0500

    Fixed bugs in gemm/gemmtrsm ukr tests in testsuite.
    
    Details:
    - Fixed a bug in gemmtrsm test module that was due to improper partitioning
      into a k x k triangular matrix for the purposes of obtaining an mr x k
      micropanel of A with which to test.
    - Fixed a bug in gemm and gemmtrsm test modules that would only manifest for
      very large k (depending on the product of mr x kc on that architecture).
      The bug arose from the fact that the test module was triggering the
      allocation of blocks from the internal memory pools, which are limited in
      size. This allocation imposes an implicit assumption that the micro-
      panel being tested with will fit inside, and this assumption is violated
      for large values of k. Arbitrarily large k may now be tested for both
      operation tests.
    - Added OpenMP/pthread critical sections around the setting or getting of
      statuses from the induced method operation lookup table in bli_l3_ind.c.
    - Added the 'static' keyword to all pthread_mutex_t global variables in BLIS.
    - Thanks to Nisanth Padinharepatt of AMD for reporting the first and third
      issues.

commit 8e917b256ca2d4bcdc059fe98d86be8775c69561
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Sep 9 14:10:15 2017 -0500

    Updated bibtex info for BLIS5 (3m4m) article.

commit 7be887057358df4978a4833eeae0c17e15acd9d1
Author: Nisanth M P <nisanth.padinharepatt@amd.com>
Date:   Mon Aug 28 17:38:22 2017 +0530

    Merging "Adding auto hardware detection for Zen"
    
    Change-Id: Id450fb0c4f91a5cd5cbdc06970f4f9ed28dd8520

commit e056d810d16621891ead032603de0c2105cfc0f7
Author: sthangar <Santanu.Thangaraj@amd.com>
Date:   Mon Aug 28 16:44:42 2017 +0530

    Bug fix for the testsuite build failing
    
    Change-Id: I7cd8c9d187387c48b2564e45cbfb8df985e93d77

commit 83796b7caf745fafc263e9e5e1bfcf5eff00c025
Merge: 8176f4e4 d1ee7762
Author: Kiran Varaganti <Kiran.Varaganti@amd.com>
Date:   Mon Aug 28 05:23:28 2017 -0400

    Merge "Adding auto hardware detection for Zen" into amd-staging

commit d1ee776202b26874333af7a91b6d2686342c4c81
Author: sthangar <Santanu.Thangaraj@amd.com>
Date:   Wed Aug 23 13:01:14 2017 +0530

    Adding auto hardware detection for Zen
    
    Change-Id: I40ce6705dd66b35000c4ccddffad1c5b65998caf

commit 8176f4e43872714b997f1a5f83056daadb0ff1a5
Merge: 12413018 adafe974
Author: praveeng <praveen.g@amd.com>
Date:   Mon Aug 28 12:21:16 2017 +0530

    resolving conflicts bli_gemm_front.c and LICENCE
    
    Change-Id: Id24ce53896d4c1c7ceccc3e004014a0ecceb5474

commit 57e1e5cd51e7ffe8612c96a20b6a041b55426ddb
Merge: f86ce54d d6ef56c6
Author: Nisanth M P <nisanth.padinharepatt@amd.com>
Date:   Tue Aug 22 17:07:44 2017 +0530

    Merge AMD authored changes

commit adafe974b4bc3fc0663bc2f6f4ce2fde71a97988
Merge: f86ce54d 7dc78b49
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue Aug 15 15:17:21 2017 -0500

    Merge pull request #150 from devinamatthews/vzeroupper
    
    Add vzeroupper to Intel AVX kernels.

commit 7dc78b49f97e6b3cd6d72fcdc588ace534d0e700
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue Aug 15 10:02:25 2017 -0500

    Add vzeroupper to Intel AVX kernels.

commit f86ce54d6f315006984534fe29e47a2deaacc9f5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 10 16:24:28 2017 -0500

    Removed trailing enum commas from bli_type_defs.h.
    
    Details:
    - Removed trailing commas from enums in bli_type_defs.h. Thanks to
      Erling Andersen for pointing out this inconsistency and suggesting
      the change.

commit 60a1eeb2317939d732b9eb6ff1e0d6d668c9a1e5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Aug 5 13:04:31 2017 -0500

    Added edge handling to _determine_blocksize_b().
    
    Details:
    - Added explicit handling of situations where i == dim to
      bli_determine_blocksize_b_sub(). This isn't actually needed by any
      current use case within BLIS, but handling the situation is nonetheless
      prudent. Thanks to Minh Quan for reporting this issue and requesting
      the fix.

commit b01c80829907d50ec79977fba8e7b53cfe7db80a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Aug 4 14:17:44 2017 -0500

    Fixed a minor bug in level-3 packm management.
    
    Details:
    - Fixed a bug in bli_l3_packm() that caused cntl_t-cached packed mem_t
      entries to be released and then re-acquired unnecessarily. (In essence,
      the "<" operands in the conditional that guards the
      release-and-reacquire code block simply needed to be swapped.) The bug
      should have only affected performance (rather than the computed result).
      Thanks to Minh Quan for identifying and reporting the bug.

commit 8b379069fcd4811669855b1248ece831f190dff6
Merge: 1f3a5819 05925dd5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Aug 1 15:30:40 2017 -0500

    Merge branch 'master' into rt

commit 05925dd5d30e8f403bb671ce33029170d65ce7c0
Merge: 803bbef0 cecdc05d
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue Aug 1 09:31:02 2017 -0500

    Merge pull request #146 from devinamatthews/master
    
    Change lsame_ signature to match lapacke.

commit cecdc05d2834786a84ff85775d3f99a958c0765a
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Jul 31 15:19:51 2017 -0500

    Change lsame_ signature to match lapacke.

commit 803bbef0a386dd0571ad389f69d55154dbfe3c50
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jul 29 20:17:05 2017 -0500

    Fixed pthreads compile bug with previous commit.
    
    Details:
    - Erroneously passed family parameter into l3int_t function despite
      that function not taking the parameter. Oops.

commit c63980f4ca750618f359031d0691289b1abf5146
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jul 29 14:53:39 2017 -0500

    Moved 'family' field from cntx_t to cntl_t.
    
    Details:
    - Removed the family field inside the cntx_t struct and re-added it to the
      cntl_t struct. Updated all accessor functions/macros accordingly, as well
      as all consumers and intermediaries of the family parameter (such as
      bli_l3_thread_decorator(), bli_l3_direct(), and bli_l3_prune_*()). This
      change was motivated by the desire to keep the context limited, as much
      as possible, to information about the computing environment. (The family
      field, by contrast, is a descriptor about the operation being executed.)
    - Added additional functions to bli_blksz_*() API.
    - Added additional functions to bli_cntx_*() API.
    - Minor updates to bli_func.c, bli_mbool.c.
    - Removed 'obj' from bli_blksz_*() API names.
    - Removed 'obj' from bli_cntx_*() API names.
    - Removed 'obj' from bli_cntl_*(), bli_*_cntl_*() API names. Renamed routines
      that operate only on a single struct to contain the "_node" suffix to
      differentiate with those routines that operate on the entire tree.
    - Added enums for packm and unpackm kernels to bli_type_defs.h.
    - Removed BLIS_1F and BLIS_VF from bszid_t definition in bli_type_defs.h.
      They weren't being used and probably never will be.

commit 07837395560d413a1ba828163b41186e21a7bcfe
Merge: ca1d1d85 ad8610b4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jul 21 16:49:48 2017 -0500

    Merge pull request #139 from Maratyszcza/emscripten
    
    Fix Emscripten builds

commit ad8610b4415cc7982804d74f9aba29875e9e2b6c
Merge: 8772a0b3 ca1d1d85
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jul 21 15:18:33 2017 -0500

    Merge branch 'master' into emscripten

commit ca1d1d8560c9ab1a7e3b0ac43ac70d08075bf904
Merge: b537b5bb 733faf84
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Jul 21 09:49:50 2017 -0500

    Merge pull request #144 from devinamatthews/fix_atomics_on_bgq
    
    Add fallbacks to __sync_* or __c11_atomic_* builtins...

commit 733faf848dcc54834fcdfbb0185dc644978d8864
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Thu Jul 20 14:50:13 2017 -0500

    Clang can't make up it's mind what to support.

commit 7425d0744d9e9cd29a887120e57c2b43ba287040
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Thu Jul 20 12:54:58 2017 -0500

    Add default #define for __has_extension.

commit b537b5bbe8cbee459a85bac11458498ae2bce4de
Merge: 1f1ec0db 7f41bb0a
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Thu Jul 20 10:58:39 2017 -0500

    Merge pull request #133 from devinamatthews/haswell-packdim
    
    Fix prefetching in haswell ukernel

commit 8823f91a14638ce6f4e45e67df03212bb61609d6
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Thu Jul 20 10:04:34 2017 -0500

    Add fallbacks to __sync_* or __c11_atomic_* builtins when __atomic_* is not supported. Fixes #143.

commit 1f1ec0db9380b87679d5c771c4594daa1cfc5f0d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 19 15:40:48 2017 -0500

    Updated ar option list used by all configurations.
    
    Details:
    - Dropped 'u' from the list of modifiers passed into the library archiver
      ar. Previously, "cru" was used, while now we employ only "cr". This
      change was prompted by a warning observed on Ubuntu 16.04:
    
        ar: `u' modifier ignored since `D' is the default (see `U')
    
      This caused me to realize that the default mode causes timestamps to be
      zero, and thus the 'u' option, which causes only changed object files to
      be inserted, is not applicable.

commit 5caaba2d61cbbc36d63102a0786ece28ff797f72
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 19 13:51:53 2017 -0500

    Added --force-version=STRING option to configure.
    
    Details:
    - Added an option to configure that allows the user to force an arbitrary
      version string at configure-time. The help text also now describes the
      usage information.
    - Changed the way the version string is communicated to the Makefile.
      Previously, it was read into the VERSION variable from the 'version' file
      via $(shell cat ...). Now, the VERSION variable is instead set in
      config.mk (via a configure-substituted anchor from config.mk.in).

commit 13175c5fb70fb6a378d5fff6ecede62e5ea6a1f6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 18 17:56:00 2017 -0500

    Updated openmp/pthread barriers with GNU atomics.
    
    Details:
    - Updated the non-tree openmp and pthreads barriers defined in
      bli_thrcomm_openmp.c and bli_thrcomm_pthreads.c to instead call a common
      implementation in bli_thrcomm.c, bli_thrcomm_barrier_atomic(). This new
      implementation goes through the same motions as the previous codes, but
      protects its loads and increments with GNU atomic built-ins. These atomic
      statements take memory ordering parameters that allow us to specify just
      enough constraints for the barrier to work as intended on weakly-ordered
      hardware. The prior implementation was only guaranteed to work on systems
      with strongly- ordered memory. (Thanks to Devin Matthews for suggesting
      this change and his crash-course in atomics and memory ordering.)
    - Removed 'volatile' from structs' barrier field declarations in
      bli_thrcomm_*.h.
    - Updated bli_thrcomm_pthread.? files to use renamed struct barrier fields
      consistent with that of the _openmp.? files.
    - Updated other bli_thrcomm_* files to rename "communicator" variables to
      simply "comm".

commit 0e58ba1b3aa84700ca51a96f1c0eed6067562fba
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jul 17 19:03:22 2017 -0500

    Added API to set mt environment variables.
    
    Details:
    - Renamed bli_env_get_nway() -> bli_thread_get_env().
    - Added bli_thread_set_env() to allow setting environment variables
      pertaining to multithreading, such as BLIS_JC_NT or BLIS_NUM_THREADS.
    - Added the following convenience wrapper routines:
        bli_thread_get_jc_nt()
        bli_thread_get_ic_nt()
        bli_thread_get_jr_nt()
        bli_thread_get_ir_nt()
        bli_thread_get_num_threads()
        bli_thread_set_jc_nt()
        bli_thread_set_ic_nt()
        bli_thread_set_jr_nt()
        bli_thread_set_ir_nt()
        bli_thread_set_num_threads()
    - Added #include "errno.h" to bli_system.h.
    - This commit addresses issue #140.
    - Thanks to Chris Goodyer for inspiring these updates.

commit 8772a0b33a90154c80d88b381dcdd66f824e041f
Author: Marat Dukhan <marat@fb.com>
Date:   Thu Jul 13 21:39:24 2017 -0700

    Fix Emscripten builds

commit 72c8b49bb8d3b9370b2cc37718da22f065de9c57
Merge: 70cc825b ba7cada5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 12 14:58:12 2017 -0500

    Merge pull request #138 from hominhquan/membrk_set_free_fp
    
    Set missing free_fp in bli_membrk_init for free-ing GEN_USE buffers

commit ba7cada51a238d320528e3504ed0f0a17a6b022a
Author: Minh Quan HO <mqho@kalray.eu>
Date:   Fri Jul 7 10:52:05 2017 +0200

    set missing free_fp in bli_membrk_init for free-ing GEN_USE buffers
    
    The membrk's free_fp is called when releasing GEN_USE buffers, but this free_fp is
    not set in bli_membrk_init

commit 1241301869957c96f16a2c6567e3ad70afa547de
Merge: 969b67e8 25ead66f
Author: Kiran Varaganti <Kiran.Varaganti@amd.com>
Date:   Wed Jul 5 02:24:00 2017 -0400

    Merge "Reducing the framework overhead of GEMV routines" into amd-staging

commit 25ead66fb78557f73af48bac305724d5d8aa3309
Author: sthangar <Santanu.Thangaraj@amd.com>
Date:   Fri Jun 30 12:23:19 2017 +0530

    Reducing the framework overhead of GEMV routines
    
    Change-Id: I83607ad767bff74e305e915b54b0ea34ec3e5684

commit 969b67e8800fbd5d14a086606f3b5afbf66ed093
Author: Kiran Varaganti <Kiran.Varaganti@amd.com>
Date:   Tue Jul 4 12:57:32 2017 +0530

    Improved efficiency of dGEMM for large matrices by reducing TLB load misses and majorly L3 cache misses. This is achieved by changing the packed block sizes of matrix A & B. Now the optimum values are MC_D = 510 and KC_D = 1024.
    
    Change-Id: I2d8bdd5f62f2d1f8782ae2997f3d7a26587d1ca4

commit 70cc825b552dec05165b9d70f9e6eb33d8abb118
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue Jun 6 21:58:21 2017 -0500

    Update LICENSE
    
    Remove totally unnecessary first 9 lines and hopefully get Github to recognize it as 3BSD [ci skip].

commit cf54c77bc79a0f33a514be72c80a654c4e6e6f63
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue Jun 6 20:23:17 2017 -0500

    Add new SSI acknowledgment

commit d6ef56c6dbaf6df8ee1af1ca6a0f0792a811396a
Author: prangana <pradeep.rao@amd.com>
Date:   Thu Jun 1 16:11:09 2017 +0530

    Update version number
    
    Change-Id: Ib6e52d1d34c0791367ab9152dfab31f94deedeb4

commit 897bfa0e92082c30bbb74229562d7d7327cbbac8
Author: prangana <pradeep.rao@amd.com>
Date:   Thu Jun 1 16:11:09 2017 +0530

    Update version number
    
    Change-Id: Ib6e52d1d34c0791367ab9152dfab31f94deedeb4

commit 99d0ba5606d4b63e6a9c639aa78d4defc2455f79
Merge: be2c7eb8 6d17e012
Author: Santanu Thangaraj <Santanu.Thangaraj@amd.com>
Date:   Thu Jun 1 02:19:02 2017 -0400

    Merge "Checked in the small matrix code to compute GEMM called with A transpose case" into amd-staging

commit 6d17e0120fe5c127b941136ad2c0c08e91439535
Author: sthangar <Santanu.Thangaraj@amd.com>
Date:   Wed May 24 11:48:16 2017 +0530

    Checked in the small matrix code to compute GEMM called with A transpose case
    
    Change-Id: I29f40046d43d7a4b037c1cb322503ee26495f462

commit 9d93f8481a1404695f7b78a3ced8ca47e890b649
Author: prangana <pradeep.rao@amd.com>
Date:   Tue May 30 09:58:10 2017 +0530

    Update Licence File
    
    Change-Id: I4c5cf1690d0cef92a68400f9a89e454ab6856ad2

commit be2c7eb85168937bd4318f4d05ded37620119310
Author: prangana <pradeep.rao@amd.com>
Date:   Tue May 30 09:58:10 2017 +0530

    Update Licence File
    
    Change-Id: I4c5cf1690d0cef92a68400f9a89e454ab6856ad2

commit 7f41bb0a0becde6a7de7df0f99668d7b4686c3b0
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri May 26 14:49:31 2017 -0400

    PACKDIM_MR=8 didn't work out, but messing with the prefetching helps 2%.

commit d87614af3f3d9187be94d6e77984b282bf890928
Author: Devin Matthews <dmatthews@gator3.ufhpc>
Date:   Fri May 26 14:47:36 2017 -0400

    Revert "Change PACKDIM_MR (double) for haswell to 8."
    
    This reverts commit 681eec913d7c2ebcff637cec5c1627ced9a92b99.

commit 681eec913d7c2ebcff637cec5c1627ced9a92b99
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri May 26 12:28:09 2017 -0500

    Change PACKDIM_MR (double) for haswell to 8.

commit 0a3ae0ecaa0ddcb5887005d7051fa234499f1120
Merge: 0f4e6652 6e04f9df
Author: praveeng <praveen.g@amd.com>
Date:   Sat May 20 16:53:50 2017 +0530

    frame/3/gemm/bli_gemm_front.c
    
    Change-Id: I52a0fbc1d33bb948d430942323bbc5fe44e3ca13

commit 6e04f9df01d79c1b0e673943ca0d5d0a6095eb2e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed May 17 13:03:52 2017 -0500

    Restored deleted lines from makefile fragments.

commit ec5c0c0448275280dca0991f6f33afeb73650450
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed May 17 12:29:44 2017 -0500

    Change to /bin/sh.
    
    All scripts checked with Debian's checkbashisms. Also check for clang first in auto-detect.sh.

commit 555ddc30d4c7e44f3f335e436c98606f56e1598b
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed May 17 12:27:14 2017 -0500

    Remove shebangs from makefiles.

commit f26bd7f42e0c2a47fe321b2c452644990b689654
Merge: cbf8710a 169fb05f
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed May 17 11:58:41 2017 -0500

    Merge pull request #128 from iotamudelta/master
    
    Portability and clang

commit 169fb05f225c2f060265bcaa872f7f80dc638b70
Author: J M Dieterich <dieterich@ogolem.org>
Date:   Tue May 16 23:11:22 2017 -0400

    Fix if/else structure. Thanks to TravisCI.

commit 0579dfea0bcfbb90ebc073fcf78b92a5cf7238e1
Author: J M Dieterich <dieterich@ogolem.org>
Date:   Tue May 16 22:58:07 2017 -0400

    Restore version.

commit a75b05c23dc786a1fdc45dc1627a5ce2299f1a7b
Author: J M Dieterich <dieterich@ogolem.org>
Date:   Tue May 16 22:23:27 2017 -0400

    Mark piledriver compilable w/ clang.

commit 7541d46e2ba8659bb2e36b444edef112fefa1345
Author: J M Dieterich <dieterich@ogolem.org>
Date:   Tue May 16 22:12:12 2017 -0400

    Mark bulldozer compilable w/ clang.

commit 91f897073ec0df3330ede449c4d6af8158266ae3
Author: J M Dieterich <dieterich@ogolem.org>
Date:   Tue May 16 22:06:59 2017 -0400

    Correct error message.

commit f5131e1e49167f948bddd714bb1af1761829c212
Author: J M Dieterich <dieterich@ogolem.org>
Date:   Tue May 16 22:03:23 2017 -0400

    Indeed once can compile for carrizo also using clang.

commit 5fa4e9439c04f35f89dd7d26ff742cb2dadc3180
Author: J M Dieterich <dieterich@ogolem.org>
Date:   Tue May 16 21:50:49 2017 -0400

    A bunch of shebang fixes from unportable /bin/bash to portable /usr/bin/env bash

commit 1f3a58197e5d5f9ac862bda91e7527cbfbab5d76
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon May 8 16:10:03 2017 -0500

    Housekeeping, induced method file/function renames.
    
    Details:
    - Renamed all level-3 induced method files to use the "_vir.c" suffix
      instead of "_ref.c". Also renamed functions within these files
      accordingly.
    - Renamed cpp macro definitions in frame/ind/include according to the
      above changes.
    - Removed frame/3/old.

commit cbf8710a1ba63e25aadaa6fc5da51ea81b3d596d
Merge: cf39d3ef fdc66f12
Author: Tyler Michael Smith <tms@cs.utexas.edu>
Date:   Mon May 8 11:21:20 2017 -0500

    Merge pull request #127 from devinamatthews/fix_blis_nt_xx
    
    Setting any one of BLIS_NT_[IJ][CR] overrides BLIS_NUM_THEADS

commit cf39d3ef3b29b8058c39fb4638c1a734fe64aaed
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri May 5 15:06:56 2017 -0500

    Fixed a bug in norm1v, norm1m.
    
    Details:
    - Fixed a bug that manifested as improperly-computed 1-norm for vectors
      and matrices. This is one of the few operations in BLIS that does not
      have its own test module within the testsuite, hence why it went
      undetected for so long. The bad 1-norms were being used to normalize
      matrices in the testsuite after initialization, which led to some
      matrices containing a combination of "large" and "small" values. This
      tended to push the residuals computed after each test away from zero.
      In some cases, they were off *just* enough to the testsuite to label
      it a "failure". Many thanks to Jeff Hammond for reporting this bug.
      (Wonky details: the bug was due to improperly-defined level-0 scalar
      macros for abval2, an operation that computes the absolute square,
      or complex magnitude/modulus. Certain complex domain instances of
      abval2 were being incorrectly defined in terms of real-only solutions,
      leading to bad results. This level-0 operation forms the basis of
      norm1v/norm1m. absq2 was also affected, but almost nothing uses
      this operation.)

commit 799485124f4d823e908d2e5d38b0c3a1e6172ade
Merge: 773a24ef 0df3541f
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Thu May 4 10:52:09 2017 -0500

    Merge pull request #121 from jeffhammond/not-real-knl
    
    allow KNL build without hbwmalloc (i.e. emulated)

commit fdc66f12d40754ff46179804bff592fddafbca02
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Thu May 4 10:35:22 2017 -0500

    Setting any one of BLIS_NT_[IJ][CR] overrides BLIS_NUM_THEADS. Missing BLIS_NT_XX's are defaulted to 1. Fixes #123.

commit 773a24efb2fa1c3a220bf0ce1dd621a3176196da
Merge: dd58c954 b8854259
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed May 3 15:07:59 2017 -0500

    Merge branch 'master' of github.com:flame/blis

commit dd58c9545c877c3f7553eaebca7b5e9720a66f5d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed May 3 15:04:51 2017 -0500

    Disable complex 3m/4m in testsuite by default.
    
    Details:
    - Disabled testsuite tests of all level-3 implementations based on 3m
      and 4m. This will improve testing runtime on Travis CI as well as for
      anyone manually running the testsuite using default test parameters.
      Thanks to Devin Matthews for suggesting this change.

commit 0df3541f54b7fe0c604ab2ec47ba814f12391798
Author: Jeff Hammond <jeff.science@gmail.com>
Date:   Tue May 2 19:25:21 2017 -0700

    allow KNL build without hbwmalloc.h (i.e. emulated)
    
    we want to be able to run BLIS KNL binaries on non-KNL machines via SDE.
    although it is possible to install hbwmalloc implementation on such
    systems, it is easier not to, since obviously the performance of SDE
    execution is not representative so there is no reason to emulate HBW
    allocation.

commit b88542591d4dd0cde366e5ae35afd3205cb81bdc
Merge: 43007f7b c2c91e09
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 2 19:22:41 2017 -0500

    Merge pull request #107 from jeffhammond/intel-compilers-no-use-libm
    
    never use libm with Intel compilers

commit 43007f7b65ec7926cbbfc39965ff733fa251c15f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 2 16:48:43 2017 -0500

    Fixed stray parentheses in README citations.

commit a4f1d0b8801c114e9ef8be39df01e1b8d27ebcb3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 2 16:38:43 2017 -0500

    CHANGELOG update (0.2.2)

commit 940a707ac78de975110e17c95765e65b89aa5e10 (tag: 0.2.2)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 2 16:38:42 2017 -0500

    Version file update (0.2.2)

commit d5a5e003ea9b24bb6abf12e88862e8eb61ffb03d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 2 15:48:30 2017 -0500

    Fixed a trsm1m bug that affected right-side cases.
    
    Details:
    - Fixed a bug introduced in 1c732d3 that affected trsm1m_r. The result
      was nondeterministic behavior (usually segmentation faults) for certain
      problem sizes beyond the 1m instance of kc (e.g. 128 on haswell). The
      cause of the bug was my commenting out lines in bli_gemm1m_ukr_ref.c
      which explicitly directed the virtual gemm micro-kernel to use temporary
      space if the storage preference of the [real domain] gemm ukernel did
      not match the storage of the output matrix C. In the context of gemm,
      this handling is not needed because agreement between the storage pref
      and the matrix is guaranteed by a high-level optimization in BLIS.
      However, this optimization is not applied to trsm because the storage
      of C is not necessarily the same as the storage of the micro-panels of
      B--both of which are updated by the micro-kernel during a trsm
      operation. Thus, the guarantee of storage/preference agreement is not
      in place for trsm, which means we must handle that case within the
      virtual gemm micro-kernel.
    - Comment updates and a minor macro change to bli_trsm*_cntx_init() for
      3m1, 4m1a, and 1m.

commit e80993e71f4d571e9650a8e90ed386e32059eae5
Merge: a509fbd5 ca3a7924
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 2 12:30:28 2017 -0500

    Merge branch 'master' into 1m

commit ca3a7924770d6cf203cce4ca9f5482e1d0d4e961
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 2 12:09:39 2017 -0500

    README.md update.
    
    Details:
    - Updated bibtex entries for 4th BLIS paper, and adds entries for 5th
      and 6th BLIS papers.

commit 0f4e6652dfe9b30105d3bab328ac26d9d5c11182
Merge: 42e7f6fb 6e7de6ef
Author: praveeng <praveen.g@amd.com>
Date:   Wed Apr 19 17:54:10 2017 +0530

    Merge master code till 2017_04_19 to amd-staging
    
    Change-Id: Ibebe83c8ea2e7eb15798c2bcf214b7228a1c9518

commit 42e7f6fb2a531429ee600b2fe0293b67371c7ccb
Author: sthangar <Santanu.Thangaraj@amd.com>
Date:   Tue Mar 28 18:10:03 2017 +0530

    fixed license attribute issues in AMD added files
    
    Change-Id: I303f870a777c7cd1c1af29ea0b93f3e0a27948e4

commit 5600001e973c6cea048bd3fdb28117f1d7c98b9d
Merge: 0b190293 b3ed4933
Author: prangana <pradeep.rao@amd.com>
Date:   Mon Mar 20 13:56:33 2017 +0530

    Fix merge conflicts after sync with release branch
    
    Change-Id: Icf14a09f728befb69a73fff9fa79c4128e728310

commit 6e7de6ef84babb273dc5528a9b9d01f0febe394b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Mar 17 12:10:24 2017 -0500

    Minor updates to test/3m4m.
    
    Details:
    - Updated initial problem size and increment in Makefile.
    - Updated code in test_gemm.c to correctly query kc from context.

commit f484c6cd4389dc7ae5b972849e12e98ad5bbf9a4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Mar 17 12:07:27 2017 -0500

    Whitespace reformatting to armv8a kernels file.
    
    Details:
    - Updated formatting of function signature/header in
      kernels/armv8a/3/bli_gemm_opt_4x4.c.

commit 0b19029342ffc530fa22ef20398a26221cb8f6ec
Author: Kiran Varaganti <Kiran.Varaganti@amd.com>
Date:   Tue Mar 14 14:51:31 2017 +0530

    Code cleanup, removed warnings from trsm, removed unused routines in axpyv & scalv
    
    Change-Id: I02867f394c5f416194c4b1769a6c75f39243ec81

commit 825363bd2a5a60a923d4a6d9691dc143845a9cab
Merge: 093bdb80 513944e4
Author: praveeng <praveen.g@amd.com>
Date:   Wed Mar 8 15:42:49 2017 +0530

    Merge code from master to amd-staging as on 2017_03_08 by praveeng
    
    Change-Id: I80740081b2cb54c9b77a3e78b9fe540e170be23d

commit 093bdb80c86b06367e595aa17487139ae983822f
Author: sthangar <Santanu.Thangaraj@amd.com>
Date:   Tue Mar 7 13:35:50 2017 +0530

    Checked in Unpacked DGEMM code
    
    Change-Id: I39dcc7b238b328f73ee2675d21a5e521d0488723

commit 33923da9a108854590d386e74b6ee66b971e7796
Author: Kiran Varaganti <Kiran.Varaganti@amd.com>
Date:   Mon Mar 6 14:31:31 2017 +0530

     Added variant 10 for double precision axpyv microkernel
    
    Change-Id: I7a20cc113a422603250bc450825c965136354974

commit bc828f7f8e3ddb9f58af07edc0b935b21759fb0f
Author: Kiran Varaganti <Kiran.Varaganti@amd.com>
Date:   Fri Mar 3 14:45:35 2017 +0530

    Added new axpyv (single precision) microkernel where it performs 10 FMAs per loop- This gives better performance than all other implementations of axpyv
    
    Change-Id: Ic4f0e4c67e367d67d0b24febcf34f81a70a39972

commit c9949f4603419267c10973adf1d63ec38497475d
Author: sthangar <Santanu.Thangaraj@amd.com>
Date:   Fri Feb 17 14:16:33 2017 +0530

    Checked in DGEMMTRSM and edge case handling routine in DDOTXF
    
    Change-Id: I65f00661af6c09b2507294fd43e0a10641c0597e

commit a509fbd5ac04fafd4e51b43d2f59ca56432dc212
Merge: 69b4846a 513944e4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 21 17:06:16 2017 -0600

    Merge branch 'master' into 1m

commit 69b4846ae9adb157c4171b52e159684db2867853
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 21 15:33:39 2017 -0600

    Disabled experiment-related 1m code.
    
    Details:
    - Commented out code in frame/ind/oapi/bli_l3_3m4m1m_oapi.c that was
      specifically inserted to facilitate the benchmarking of 1m block-panel
      and panel-block algorithms.
    - Updates to test/3m4m/Makefile, runme.sh script, and test_gemm.c to
      reflect changes used/needed during benchmarking.

commit 513944e4a951d8823b4de161b86ad7a965b4d99b
Merge: 8b462a0e 0e18f68c
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Feb 20 10:04:33 2017 -0500

    Merge pull request #118 from devinamatthews/master
    
    Handle k=0 correctly in KNL dgemm ukernel.

commit 0e18f68cf12eb9189ba901a20040b1cdae417670
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Feb 20 09:03:21 2017 -0600

    Handle k=0 correctly in KNL dgemm ukernel.

commit 8b462a0e8c3e9252f0401940849e53cc772256fa
Merge: c362afc5 7d42fc07
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Sun Feb 19 23:03:03 2017 -0500

    Merge pull request #117 from devinamatthews/master
    
    Cast dim_t and inc_t parameters to 64-bit in KNL microkernels.

commit 7d42fc0796ef0c010375fd8e59b1240ba41ce4d2
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Sun Feb 19 21:10:55 2017 -0500

    Cast dim_t and inc_t parameters to 64-bit in KNL microkernels.

commit 04245c9ff7f8b3c70d61003029c964bb9a4320ee
Author: Kiran Varaganti <Kiran.Varaganti@amd.com>
Date:   Fri Feb 10 14:24:30 2017 +0530

    Reoptimized scalv routines - two vector multiplies are done per iteration, and these routines are enabled in bli_kernel.h
    
    Change-Id: Ic5654508573d1f6bde2edef06aefe117e581feb5

commit c362afc525bab4050581d1b0fcea2fe4d582c608
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 9 11:54:59 2017 -0600

    Added missing "level-0" BLAS [sd]cabs1_().
    
    Details:
    - Fixed issue #115 by adding implementations for scabs1_() and dcabs1_()
      to the BLAS compatibility layer. Thanks to heroxbd for pointing out
      their absence.

commit 018180c938c32efbeaaf626ba71ec5b780664db1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Feb 8 11:20:52 2017 -0600

    Fixed a minor bug in configure (issue #114).
    
    Details:
    - Fixed a bug in the configure script whereby a non-preferred value for
      --enable-threading would cause problems in common.mk vis-a-vis detecting
      which threading model was chosen. Thanks to heroxbd for reporting this
      issue.

commit 58b5b77e5fdb179ea465e398e416e6a00d917e05
Author: Kiran Varaganti <Kiran.Varaganti@amd.com>
Date:   Wed Feb 8 21:43:34 2017 +0530

    Fixed a bug in axpyv, the arguments passed to intrinsic fmad instruction are corrected
    
    Change-Id: If12f24c6bc74b22ac9e4acd6b9378e06d79f2f5e

commit 85de4ebf74d0a5587d5a12724eb5489d51674db3
Author: Kiran Varaganti <Kiran.Varaganti@amd.com>
Date:   Wed Feb 8 14:41:04 2017 +0530

    variant 4 axpyv single precision modified: explicitly used FMA intrinsics, replaced vector multiply and add operations
    
    Change-Id: I975feef56696d479d2b9e9441b0660021cf4f6ff

commit 3fa53e8af31d634779f40258c51483ae8af494fa
Merge: b5291a44 95be7b04
Author: Kiran Varaganti <Kiran.Varaganti@amd.com>
Date:   Wed Feb 8 11:46:34 2017 +0530

    Merged axpyv and gemm small in bli_kernel.h
    Merge branch 'amd-staging' of ssh://git.amd.com:29418/cpulibraries/er/blis into amd-staging
    
            modified:   config/zen/bli_kernel.h
            modified:   frame/3/gemm/bli_gemm_front.c
            modified:   kernels/x86_64/zen/3/bli_gemm_small_matrix.c
    
    Change-Id: If181cf9345178c448b3530beb8bef453917fe295

commit 95be7b04709e688a4cb01fba680081e30f4258ef
Author: sthangar <Santanu.Thangaraj@amd.com>
Date:   Tue Feb 7 14:01:27 2017 +0530

    Added logic for packing matrix A and prefetching matrix C in Unpacked SGEMM code
    
    Change-Id: I99efeca9eb5b4449286ec0ec133fd554ef1bb4f0

commit b5291a445b1313e01f1e0e8102c5f3660ab07f69
Author: Kiran Varaganti <Kiran.Varaganti@amd.com>
Date:   Tue Feb 7 12:39:31 2017 +0530

    Added optimization variant 4 for axpyv single precision - this performs 5 FMA per loop, keeping the IPC always full
    
    Change-Id: Ie77ed22584271136a257e673bcd3b1ba71136bc9

commit f4bfc1662af82aa4b98185334c44835e51f1cbec
Author: Kiran Varaganti <Kiran.Varaganti@amd.com>
Date:   Mon Feb 6 15:04:27 2017 +0530

    New routines implemented for axpyv to improve performance for small vector sizes, vectorization is done for vectors as small as 8 (single precision) 4(double precision), since this operation has low compute to memory ratio, higher matrix sizes memory operations are dominating and hence not much gain - This still needs some work- added saxpyv and daxpyv var 3 routines in the file bli_axpyv_opt_var1.c
    
    Change-Id: Ic1b33bd5516e10113b00e44ab41b97eb19d46072

commit ddf45e71770c55ea4a58ca24ea4913fe5d8beb9b
Merge: a6ab91bc 78e1b16e
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Jan 27 14:25:40 2017 -0600

    Merge pull request #113 from devinamatthews/knl_thread_params
    
    Change default threading parameters for KNL.

commit 78e1b16e16d589ed31b2e712115ee282097f114d
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Jan 27 14:22:20 2017 -0600

    Change default threading parameters for KNL.

commit 574472ba5a89924eca7dbd10055d0e1dcd7f4c71
Author: sthangar <Santanu.Thangaraj@amd.com>
Date:   Tue Jan 10 14:51:46 2017 +0530

    checked in unpacked SGEMM optimization
    
    Change-Id: I8e4ea374415c0c402c660b656fb076af15354181

commit 1c732d3ddc4ac0861d3b0e0dd15eb7e071615502
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jan 25 16:25:46 2017 -0600

    Added 1m-specific APIs for bp, pb gemm algorithms.
    
    Details:
    - Defined bli_gemmbp_cntl_create(), bli_gemmpb_cntl_create(), with the
      body of bli_gemm_cntl_create() replaced with a call to the former.
    - Defined bli_cntl_free_w_thrinfo(), bli_cntl_free_wo_thrinfo(). Now,
      bli_cntl_free() can check if the thread parameter is NULL, and if so,
      call the latter, and otherwise call the former.
    - Defined bli_gemm1mbp_cntx_init(), bli_gemm1mpb_cntx_init(), both in
      terms of bli_gemm1mxx_cntx_init(), which behaves the same as
      bli_gemm1m_cntx_init() did before, except that an extra bool parameter
      (is_pb) is used to support both bp and pb algorithms (including to
      support the anti-preference field described below).
    - Added support for "anti-preference" in context. The anti_pref field,
      when true, will toggle the boolean return value of routines such as
      bli_cntx_l3_ukr_eff_prefers_storage_of(), which has the net effect of
      causing BLIS to transpose the operation to achieve disagreement (rather
      than agreement) between the storage of C and the micro-kernel output
      preference. This disagreement is needed for panel-block implementations,
      since they induce a transposition of the suboperation immediately before
      the macro-kernel is called, which changes the apparent storage of C. For
      now, anti-preference is used only with the pb algorithm for 1m (and not
      with any other non-1m implementation).
    - Defined new functions,
        bli_cntx_l3_ukr_eff_prefers_storage_of()
        bli_cntx_l3_ukr_eff_dislikes_storage_of()
        bli_cntx_l3_nat_ukr_eff_prefers_storage_of()
        bli_cntx_l3_nat_ukr_eff_dislikes_storage_of()
      which are identical to their non-"eff" (effectively) counterparts except
      that they take the anti-preference field of the context into account.
    - Explicitly initialize the anti-pref field to FALSE in
      bli_gks_cntx_set_l3_nat_ukr_prefs().
    - Added bli_gemm_ker_var1.c, which implements a panel-block macro-kernel
      in terms of the existing block-panel macro-kernel _ker_var2(). This
      technique requires inducing transposes on all operands and swapping
      the A and B.
    - Changed bli_obj_induce_trans() macro so that pack-related fields are
      also changed to reflect the induced transposition.
    - Added a temporary hack to bli_l3_3m4m1m_oapi.c that allows us to easily
      specify the 1m algorithm (block-panel or panel-block).
    - Renamed the following cntx_t-related macros:
        bli_cntx_get_pack_schema_a() -> bli_cntx_get_pack_schema_a_block()
        bli_cntx_get_pack_schema_b() -> bli_cntx_get_pack_schema_b_panel()
        bli_cntx_get_pack_schema_c() -> bli_cntx_get_pack_schema_c_panel()
      and updated all instantiations. Also updated the field names in the
      cntx_t struct.
    - Comment updates.

commit 41595e98eedaf3f1f93802c14dcae490402f933f
Merge: d625c49e a6ab91bc
Author: praveeng <praveen.g@amd.com>
Date:   Wed Dec 7 15:13:21 2016 +0530

    Merge master code as on 2016_12_07 to amd-staging
    
    Change-Id: I5d9ecef9bff960aeb9b51ca4e4b21714e789e44f

commit d625c49e20bd3c50d6d44e330e34076cced114a3
Author: sthangar <Santanu.Thangaraj@amd.com>
Date:   Tue Nov 29 15:05:19 2016 +0530

    checked-in SGEMMTRSM microkernel for Zen
    
    Change-Id: Ib61936418dea911b2154aa99f703b66e9669f94f

commit a6ab91bc61432490fadf18d596de4589645f37dd
Merge: 145a551d 7f31a630
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 30 09:26:58 2016 -0600

    Merge pull request #111 from figual/master
    
    Fixed missing cntx argument in ARMv8 microkernels.

commit 7f31a6307b7bd35f913c895947552c3a176f789b
Author: Francisco Igual <figual@ucm.es>
Date:   Sun Nov 27 14:40:47 2016 +0100

    Fixed missing cntx argument in ARMv8 microkernels.

commit 126482a3b609b9ad7026ba348f6c4bf6a29be8a1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Nov 25 18:29:49 2016 -0600

    Implemented the 1m method.
    
    Details:
    - Implemented the 1m method for inducing complex domain matrix
      multiplication. 1m support has been added to all level-3 operations,
      including trsm, and is now the default induced method when native
      complex domain gemm microkernels are omitted from the configuration.
    - Updated _cntx_init() operations to take a datatype parameter. This was
      needed for the corresponding function for 1m (because 1m requires us
      to choose between column-oriented or row-oriented execution, which
      requires us to query the context for the storage preference of the
      gemm microkernel, which requires knowing the datatype) but I decided
      that it made sense for consistency to add the parameter to all other
      cntx initialization functions as well, even though those functions
      don't use the parameter.
    - Updated bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs() to take
      a second scalar for each blocksize entry. The semantic meaning of the
      two scalars now is that the first will scale the default blocksize
      while the second will scale the maximum blocksize. This allows scaling
      the two independently, and was needed to support 1m, which requires
      scaling for a register blocksize but not the register storage
      blocksize (ie: "packdim") analogue.
    - Deprecated bli_blksz_reduce_dt_to() and defined two new functions,
      bli_blksz_reduce_def_to() and bli_blksz_reduce_max_to(), for reducing
      default and maximum blocksizes to some desired blocksize multiple.
      These functions are needed in the updated definitions of
      bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs().
    - Added support for the 1e and 1r packing schemas to packm, including
      1e/1r packing kernels.
    - Added a minor optimization to bli_gemm_ker_var2() that allows, under
      certain circumstances (specifically, real domain beta and row- or
      column-stored matrix C), the real domain macrokernel and microkernel
      to be called directly, rather than using the virtual microkernel
      via the complex domain macrokernel, which carries a slight additional
      amount of overhead.
    - Added 1m support to the testsuite.
    - Added 1m support to Makefile and runme.sh in test/3m4m. Also simplified
      some code in test_gemm.c driver.

commit d8f13beeea90338e0ecb0a3aeaa2d59d8ebd6c36
Merge: c25a9205 145a551d
Author: praveeng <praveen.g@amd.com>
Date:   Fri Nov 25 17:31:08 2016 +0530

    Merge master code till  2016_11_25 to amd-staging

commit c25a9205fd8c8d8de7fd81b1e5621e7ac79f4e87
Merge: 65298762 bdc0a264
Author: praveeng <praveen.g@amd.com>
Date:   Fri Nov 25 17:06:36 2016 +0530

    Merge master code till Switched to simpler trsm_r 2016_11_25 to amd-staging
    
    Change-Id: Ibf71d224d8fb6cf0bc497f84d50c27d276512cc1

commit 145a551d524ae5492667a05fc248923d922df850
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 23 17:59:06 2016 -0600

    Switched to simpler trsm_r implementation.
    
    Details:
    - Disabled the implementation of trsm_r that allows the right-hand matrix
      B to be trianglar, and switched to the implementation that simply
      transposes the operation (and thus the storage of C) in order to recast
      the operation as trsm_l. This avoids the need to use trsm_rl and trsm_ru
      macrokernels, which require an awkward swapping of MR and NR. For now,
      the support for trsm_r macrokernels, via separate control trees, remains.
    - Modified bli_config_macro_defs.h so that BLIS_RELAX_MCNR_NCMR_CONSTRAINTS
      is defined by default. This is mostly a safety precaution in case someone
      tries to switch back to the previous trsm_r implementation, but also
      serves as a convenience on some systems where one does not naturally
      choose blocksizes in a way that satisfies MC % NR = 0 and NC % MR = 0.

commit b3e58ee30307cf1e11529f2113acb9abbeda25af
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 23 17:58:26 2016 -0600

    Reimplemented 4x12 haswell ukernels (real only).
    
    Details:
    - Replaced permutation-based implementations in bli_gemm_asm_d4x12.c, which
      defines 4x24 single real and 4x12 double real gemm microkernels, with
      broadcast-based implementations. (The previous microkernel file has been
      moved to an 'old' subdirectory.)

commit 65298762ff15c45e8588e0c279a9feaa98c927a0
Author: sthangar <Santanu.Thangaraj@amd.com>
Date:   Tue Nov 22 12:15:33 2016 +0530

    removed a redundant copy operation in DNRM2
    
    Change-Id: I673b08efde4480e871779716f7715566740ad9ce

commit d6863e851adeef037e4d1476fe63bb293fb9d987
Author: sthangar <Santanu.Thangaraj@amd.com>
Date:   Mon Nov 21 11:30:30 2016 +0530

    checked-in DNRM2 optimizations
    
    Change-Id: I3b31d768bd7f4fbf43042aa5a0762995c73c4522

commit bdc0a264d2fb5940bfd09298b1de823674a39053
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 16 14:13:08 2016 -0600

    Adjusted stride selection of ct in macrokernels.
    
    Details:
    - Updated the changes introduced in 618f433 so that the strides of the
      temporary microtile ct used in the macrokernels is determined based
      on the storage preference of the microkernel (via the new functions
      below), rather than the strides of c. In almost all cases, presently,
      this change results in no net effect, as a high-level optimization
      in the _front() functions aligns the storage of c to that of the
      microkernel's preference. However, I encountered some cases where
      this is not always the case in some development code that has yet
      to be committed, and therefore I'm generalizing the framework code
      in advance.
    - Defined two new functions in bli_cntx.c:
        bli_cntx_l3_ukr_prefers_rows_dt()
        bli_cntx_l3_ukr_prefers_cols_dt()
      which return bool_t's based on the current micro-kernel's storage
      preferences. For induced methods, the preference of the underlying
      real domain microkernel is returned.
    - Updated definition of bli_cntx_l3_ukr_dislikes_storage_of(), and
      by proxy bli_cntx_l3_ukr_prefers_storage_of(), to be in terms of
      the above functions, rather than querying the preferences of the
      native microkernel directly (which did the wrong thing for induced
      methods).

commit 031978d2647cf08316858baf29c84ebba9c3133e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 16 14:04:33 2016 -0600

    Fixed inactive trsm_r blocksize constraint code.
    
    Details:
    - Changed a cpp macro that was meant to prevent using certain trsm_r code
      if BLIS_RELAX_MCNR_NCMR_CONSTRAINTS was defined. It was actually coded
      incorrectly at first. I've now fixed its location and changed its
      consequence to a compile-time #error message.

commit 9772218cae57d55c252595b01e3669d8bed84944
Author: sthangar <Santanu.Thangaraj@amd.com>
Date:   Wed Nov 16 15:19:19 2016 +0530

    Added optimized DAMAX routines for Zen
    
    Change-Id: I499c0c8f0f4ce6c19235c47b86d5608db6ba50f8

commit 9c448e30174e5eb76a94b43b30819704a5dfcb3f
Merge: 998d8240 e35d3c23
Author: Santanu Thangaraj <Santanu.Thangaraj@amd.com>
Date:   Wed Nov 16 04:18:57 2016 -0500

    Merge "Added new optimized micro-kernel for dotxv routine" into amd-staging

commit 998d824044adac0d54c921dcd44fb58f3d54aad2
Merge: 0d13e9a4 6b5a4032
Author: praveeng <praveen.g@amd.com>
Date:   Wed Nov 16 14:22:42 2016 +0530

    Merge master code till devinamatthews/omp_num_thrds 2016_11_16 to amd-staging
    
    Change-Id: I601ff1d3ec8a680e1be039ffc7b299744e8a27c5

commit 6b5a4032d2e3ed29a272c7f738b7e3ed6657e556
Merge: 3b524a08 a8220e3a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Nov 10 15:28:24 2016 -0600

    Merge pull request #109 from devinamatthews/omp_num_threads
    
    Add automatic loop thread assignment.

commit a8220e3a86433b5d76789e32ea7ca014a11b6d17
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Thu Nov 10 14:19:34 2016 -0600

    - Fix typo in bli_cntx.c
    - Bump BLIS_DEFAULT_NR_THREAD_MAX to 4

commit e35d3c23f28784e50ee13d2e77a69d60e0c24c1f
Author: Kiran Varaganti <Kiran.Varaganti@amd.com>
Date:   Thu Nov 10 14:30:53 2016 +0530

    Added new optimized micro-kernel for dotxv routine
    
    Change-Id: I2c544e9b25a454d971ad690353502a55cd668391

commit 0d13e9a4f6f2fcda08f205215240cdf86442d6c6
Merge: e044fa62 3b524a08
Author: praveeng <praveen.g@amd.com>
Date:   Mon Nov 7 14:40:41 2016 +0530

    bli_kernel.h
    
    Change-Id: I425d089f79497a0de7d1622e829c3ca9edf7f091

commit c05b3862f6241486442b313eff0c8bee7b5e1274
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Nov 4 15:48:02 2016 -0500

    Add automatic loop thread assignment.
    
    - Number of threads is determined by BLIS_NUM_THREADS or OMP_NUM_THREADS, but can be overridden by BLIS_XX_NT as before.
    - Threads are assigned to loops (ic, jc, ir, and jc) automatically by weighted partitioning and heuristics, both of which are tunable via bli_kernel.h.
    - All level-3 BLAS covered.

commit 3b524a08e3fb8380e7b8b2ba835312c51a331570
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 2 17:45:18 2016 -0500

    Consolidated 3m1/4m1 gemmtrsm, trsm ukernel code.
    
    Details:
    - Consolidated the macros that define the lower and upper versions of the
      gemmtrsm microkernels into a single macro that is instantiated twice.
      Did this for both 3m1 and 4m1 microkernels.
    - Consolidated lower and upper versions of the trsm microkernels for 3m1
      and 4m1 into single files (each).

commit ead231aca635deb3db270f118454e4222c627f31
Merge: d25e6f8b 62987f60
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 2 13:03:50 2016 -0500

    Merge pull request #108 from devinamatthews/patch-2
    
    Update .travis.yml with additional tests

commit 62987f60a6a6ff0a75b31d0404f493593ce35ccc
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Nov 2 11:20:37 2016 -0500

    Allow KNL to fail

commit 8f9010542c751ae3cbfe6121cb011d8985c1e00d
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Nov 2 11:18:32 2016 -0500

    Fix some problems with OSX builds:
    
    - Update CPU detection for Intel archs (esp. Skylake)
    - Allow clang for the reference config

commit d25e6f8b63c57f30b8a67dffbf4995977cf9f235
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Nov 1 14:35:15 2016 -0500

    Can disable trsm_r-specific blocksize constraints.
    
    Details:
    - Added cpp guards around the constraints in bli_kernel_macro_defs.h
      that enforce MC % NR = 0 and NC % MR = 0. These constraints are ONLY
      needed when handling right-side trsm by allowing the matrix on the
      right (matrix B) to be triangular, because it involves swapping
      register, but not cache, blocksizes (packing A by NR and B by MR)
      and then swapping the operands to gemmtrsm just before that kernel
      is called. It may be useful to disable these constraints if, for
      example, the developer wishes to test the configuration with
      a different set of cache blocksizes where only MC % MR = 0 and
      NC % NR = 0 are enforced.
    - In summary, #defining BLIS_RELAX_MCNR_NCMR_CONSTRAINTS will bypass
      the enforcement of MC % NR = 0 and NC % MR = 0.

commit 1a67e3688edb073a9d44c160e7b0798e08796b8a
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue Nov 1 13:53:18 2016 -0500

    Bogus commit
    
    Need to trigger another Travis build.

commit 2cd82d67b372cad1bed50cfd99e524f1f40b4e24
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue Nov 1 13:25:50 2016 -0500

    Some fixes for .travis.yml
    
    - Switch to gcc-5 to support knl
    - Don't run tests in parallel -- it is super slow.
    - Use clang on OSX since gcc is only a zombie husk.

commit a3db4e6bdfe745083acf704ab0f51f74ea869538
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue Nov 1 10:33:18 2016 -0500

    Update .travis.yml with additional tests
    
    - Test knl configuration (without running of course).
    - Test openmp and pthreads threading for auto configuration with 4 threads.
    - Test auto configuration with and without pthreads on OSX.
    - Also, run make in parallel.
    
    I don't know how the `addons:` section works on OSX; hopefully it is just ignored.

commit 8a11a2174a1a5b9426f13bbc5338dc86ab138cdd
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 31 19:07:55 2016 -0500

    Updates to non-default haswell microkernels.
    
    Details:
    - Updated s and d microkernels in bli_gemm_asm_d8x6.c to relax alignment
      constraints.
    - Added missing c and z microkernels, which are based on the corresponding
      kernels in the d6x8 set.
    - This completes the d8x6 set (which may be used for situations when it
      is desirable to have a microkernel with a column preference).

commit 618f4331eba209803ecab99747872eceb1b5f091
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 31 14:40:51 2016 -0500

    Align strides of ct in macrokernels to that of c.
    
    Details:
    - Previously, rs_ct and cs_ct, the strides of the temporary microtile used
      primarily in the macrokernels' edge case handling, were unconditionally
      set to 1 and MR, respectively. However, Devin Matthews noted that this
      ought to be changed so that the strides of ct were in agreement with the
      strides of C. (That is, if C was row-stored, then ct should be accessed
      as by rows as well.) The implicit assumption is that the strides of C
      have already been adjusted, via induced transposition, if the storage
      preference of the microkernel is at odds with the storage of C. So, if
      the microkernel prefers row storage, the macrokernel's interior cases
      would present row-stored (ideal) microkernel subproblems to the
      microkernel, but for edge cases, it would still see column-stored
      subproblems (not ideal). This commit fixes this issue. Thanks to Devin
      for his suggestion.

commit c2c91e09b4893cb81314774557f728a95080f81e
Author: Jeff Hammond <jeff.science@gmail.com>
Date:   Tue Oct 25 21:15:26 2016 -0700

    never use libm with Intel compilers
    
    Intel compilers include a highly optimized math library (libimf) that
    should be used instead of GNU libm.
    
    yes, this change is for ALL targets, including those that are not
    supported by the Intel compiler.  there is no harm in doing this, and it
    is future-proof in the event that the Intel compilers support other
    architectures.

commit 630391002325a589063aec2ab0a7d89ef2e178c0
Merge: 956b3edf 216206c1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Oct 25 19:34:51 2016 -0500

    Merge pull request #105 from devinamatthews/knl
    
    Support for Intel Knight's Landing.

commit 216206c1d328a865c2192e35a4df6e9aff79a85b
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue Oct 25 13:56:18 2016 -0500

    Fix up for merge to master.

commit 11eb7957abbcdf02d5e312898e094260eadb1209
Merge: cd5b6681 956b3edf
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue Oct 25 13:51:07 2016 -0500

    Merge branch 'master' into knl
    
    # Conflicts:
    #       frame/thread/bli_thread.h

commit cd5b6681838899283cd94e5427dfda206e7fbabe
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue Oct 25 13:49:27 2016 -0500

    Don't use %rbp in KNL packing kernels.

commit 956b3edf8eb09480f31f2e861c1b10f9ecbb2e52
Merge: b7e41d71 0662a3c1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Oct 25 13:02:57 2016 -0500

    Merge pull request #104 from devinamatthews/misspellings
    
    Add flexible options for thread model (pthread/posix for pthreads etc.).

commit 0662a3c1b1f4644a86bf8e5073d1391808c91b4a
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue Oct 25 12:42:44 2016 -0500

    Add flexible options for thread model (pthread/posix for pthreads etc.).

commit e044fa624008c161de32a39d734cddf1dd22dd41
Author: Kiran Varaganti <Kiran.Varaganti@amd.com>
Date:   Tue Oct 25 13:03:05 2016 +0530

    Changed double precision trsm kernel macro definition to bli_dtrsm_l_int_6x8 from 6x16 : it fixes the seg fault
    
    Change-Id: Ia8c1de5fe13a370d691570a50136d55ffb18908a

commit b3ed4933aa0da72ad771fb0fdf1727e5ba9ad7b4
Author: Kiran Varaganti <Kiran.Varaganti@amd.com>
Date:   Tue Oct 25 13:03:05 2016 +0530

    Changed double precision trsm kernel macro definition to bli_dtrsm_l_int_6x8 from 6x16 : it fixes the seg fault
    
    Change-Id: Ia8c1de5fe13a370d691570a50136d55ffb18908a

commit b7e41d71b07d2af6d22d632c70e0c5f7ce46852c
Merge: 4bd905bd 5117d444
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 24 16:47:46 2016 -0500

    Merge pull request #103 from devinamatthews/patch-1
    
    Change .align to .p2align in Bulldozer ukernels.

commit 5117d444f7f3a2bc327f067926eaf2398212edda
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Oct 24 16:20:47 2016 -0500

    Change .align to .p2align in Bulldozer ukernels
    
    Apparently OSX doesn't allow .align directives for >16B, so I've changed these to their .p2align counterparts.

commit 4bd905bd4597e0ad7bedf31e25e779d3e2dfda29
Merge: 936d5fdc 7f32dd57
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 21 14:48:44 2016 -0500

    Merge pull request #93 from ShadenSmith/config_check
    
    Adds sanity check to configuration choice.

commit 936d5fdc26c6c4dab199a8d11fde948975cfa1d6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 21 14:34:27 2016 -0500

    Fixed multithreading compilation bug in 970745a.
    
    Details:
    - Moved the definition of the cpp macro BLIS_ENABLE_MULTITHREADING
      from bli_thread.h to bli_config_macro_defs.h. Also moved the
      sanity check that OpenMP and POSIX threads are not both enabled.
    - Thanks to Krzysztof Drewniak for reporting this bug.

commit d250e6a3af3af8beedcda28f508ac03e94efb3c8
Author: Kiran Varaganti <Kiran.Varaganti@amd.com>
Date:   Thu Oct 20 14:34:39 2016 +0530

    Merged TRSM and scalv routines into zen folder
    
    Change-Id: Ice897bc83e8fb70b90f23cc3ce892c39883aceb9

commit 8feb0f85a674e84bec2417486e3bcea584b14c04
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 19 16:05:41 2016 -0500

    Removed auto-prototyping of malloc()/free() substitutes.
    
    Details:
    - Removed the header file, bli_malloc_prototypes.h, which automatically
      generated prototypes for the functions specified by the following
      cpp macros:
        BLIS_MALLOC_INTL
        BLIS_FREE_INTL
        BLIS_MALLOC_POOL
        BLIS_FREE_POOL
        BLIS_MALLOC_USER
        BLIS_FREE_USER
      These prototypes were originally provided primarily as a convenience
      to those developers who specified their own malloc()/free() substitutes
      for one or more of the following. However, we generated these prototypes
      regardless, even when the default values (malloc and free) of the
      macros above were used. A problem arose under certain circumstances
      (e.g., gcc in C++ mode on Linux with glibc) when including blis.h that
      stemmed from the "throw" specification which was added to the glibc's
      malloc() prototype, resulting in a prototype mismatch. Therefore, going
      forward, developers who specify their own custom malloc()/free()
      substitutes must also prototype those substitutes via bli_kernel.h.
      Thanks to Krzysztof Drewniak for reporting this bug, and Devin Matthews
      for researching the nature and potential solutions.

commit 970745a5fc7c29de3e202988e5eb104fabca4fdc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 19 15:58:03 2016 -0500

    Reorganized typedefs to avoid compiler warnings.
    
    Details:
    - Relocated membrk_t definition from bli_membrk.h to bli_type_defs.h.
    - Moved #include of bli_malloc.h from blis.h to bli_type_defs.h.
    - Removed standalone mtx_t and mutex_t typedefs in bli_type_defs.h.
    - Moved #include of bli_mutex.h from bli_thread.h to bli_typedefs.h.
    - The redundant typedefs of membrk_t and mtx_t caused a warning on some C
      compilers. Thanks to Tyler Smith for reporting this issue.

commit 1c2f7b57d557c05f5ef6148cccafaf0f70d910da
Author: sthangar <Santanu.Thangaraj@amd.com>
Date:   Tue Oct 18 15:06:35 2016 +0530

    Removed symlinks to zen kernels from haswell kernel folder and also modified the bli_kernel.h file accordingly
    
    Change-Id: Ib3736af48e851c8243bbe10d937fb942c49ad048

commit d864ea9f4f039fe2b2dc395d0015bd9e8902bc8e
Merge: 7045fcbf 28b2af8a
Author: praveeng <praveen.g@amd.com>
Date:   Fri Oct 14 17:00:57 2016 +0530

    Merge master code 2016_10_14 till Added disabled code thrinfo_t structures
    
    Change-Id: If7db98d286c1471fcd30f00757abee9b253ef987

commit 28b2af8a71133ce68774e153b6e05afb05affba8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 13 14:50:08 2016 -0500

    Added disabled code to print thrinfo_t structures.
    
    Details:
    - Added cpp-guarded code to bli_thrcomm_openmp.c that allows a curious
      developer to print the contents of the thrinfo_t structures of each
      thread, for verification purposes or just to study the way thread
      information and communicators are used in BLIS.
    - Enabled some previously-disabled code in bli_l3_thrinfo.c for freeing
      an array of thrinfo_t* values that is used in the new, cpp-guarde code
      mentioned above.
    - Removed some old commented lines from bli_gemm_front.c.

commit 11eed3f683d09e65f721567b346b0f733bff9a64
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 13 14:23:23 2016 -0500

    Fixed a configure -t omp/openmp bug from fd04869.
    
    Details:
    - Forgot to update certain occurrences of "omp" in common.mk during
      commit fd04869, which changed the preferred configure option string
      for enabling OpenMP from "omp" to "openmp".

commit 7045fcbf0bd349ebe6cb9ac4508c6a387bb05966
Merge: 7e044900 9cda6057
Author: praveeng <praveen.g@amd.com>
Date:   Thu Oct 13 12:02:28 2016 +0530

    Merge master code 2016_10_13 Removed previously renamed/old files
    
    Change-Id: I8106d371afaa0af474a8967388d44481b05de923

commit 7e04490002206d3557fcfb7dd893838a7f36916f
Author: sthangar <Santanu.Thangaraj@amd.com>
Date:   Wed Oct 12 16:43:02 2016 +0530

    Checked in the SAMAX optimizations
    
    Change-Id: I7faf8c3adf52ff01432188ad3b9866ee4b9a9dfd

commit 9cda6057eaa16a24ac8785a9fa167df6c9edba44
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Oct 11 13:21:26 2016 -0500

    Removed previously renamed/old files.
    
    Details:
    - Removed frame/base/bli_mem.c and frame/include/bli_auxinfo_macro_defs.h,
      both of which were renamed/removed in 701b9aa. For some reason, these
      files survived when the compose branch was merged back into master.
      (Clearly, git's merging algorithm is not perfect.)
    - Removed frame/base/bli_mem.c.prev (an artifact of the long-ago changed
      memory allocator that I was keeping around for no particular reason).

commit 22377abd84b9e560ffe1c4e4d284eb443ddb7133
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 10 13:43:56 2016 -0500

    Fixed bli_gemm() segfault on empty C matrices.
    
    Details:
    - Fixed a bug that would manifest in the form of a segmentation fault
      in bli_cntl_free() when calling any level-3 operation on an empty
      output matrix (ie: m = n = 0). Specifically, the code previously
      assumed that the entire control tree was built prior to it being
      freed. However, if the level-3 operation performs an early exit, the
      control tree will be incomplete, and this scenario is now handled.
      Thanks to Elmar Peise for reporting this bug.

commit 0b571cd94d9b175331c9453258a6b1389a718ae8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 6 14:48:15 2016 -0500

    Fixed segfault in bli_free_align() for NULL ptrs.
    
    Details:
    - Fixed a bug in bli_free_align() caused by failing to handle NULL pointers
      up-front, which led to performing pointer arithmetic on NULL pointers in
      order to free the address immediately before the pointer. Thanks to Devin
      Matthews for reporting this bug.

commit cd84fb95182514601d72c78ee0e36a394d0284d7
Author: praveeng <praveen.g@amd.com>
Date:   Thu Oct 6 15:08:21 2016 +0530

    syntax erros in configure file
    
    Change-Id: Ibe8a6071aad97df550df64c009fec33a9d8f43a1

commit f2e7ea113aa93b74f1d42408d5db2c5a7b00a653
Merge: 133983c3 86969873
Author: praveeng <praveen.g@amd.com>
Date:   Thu Oct 6 12:35:30 2016 +0530

    conflicts merge for bli_kernel.h
    
    Change-Id: I15d846bd34e11f86ebfd7ed091ff671a1f3366a0

commit 133983c36fa01c7acb6d666b3744f77f216314a5
Author: sthangar <Santanu.Thangaraj@amd.com>
Date:   Thu Oct 6 11:26:22 2016 +0530

    code clean up in bli_kernel.h
    
    Change-Id: I11d9cdf2af8e8199209eb084f6c3a7c910b83d5d

commit 4fb9b4ef2e4cf2626a6e000a41628fb823f16da8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 5 14:41:35 2016 -0500

    CHANGELOG update (0.2.1)

commit 866b2dde3f41760121115fb25f096d4344e8b4f9 (tag: 0.2.1)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 5 14:41:34 2016 -0500

    Version file update (0.2.1)

commit 87fddeab3c8a5ccb1bbf02e5f89db1464e459ba9
Merge: 86969873 6f71cd34
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 5 13:35:01 2016 -0500

    Merge branch 'compose'

commit 6f71cd344951854e4cff9ea21bbdfe536e72611d (origin/compose)
Merge: c0630c40 8d55033c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Oct 4 15:53:46 2016 -0500

    Merge pull request #94 from flame/distcomm
    
    Implemented distributed thrinfo_t management.

commit 86969873b5b861966d717d8f9f370af39e3d9de6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Oct 4 14:24:59 2016 -0500

    Reclassified amaxv operation as a level-1v kernel.
    
    Details:
    - Moved amaxv from being a utility operation to being a level-1v operation.
      This includes the establishment of a new amaxv kernel to live beside all
      of the other level-1v kernels.
    - Added two new functions to bli_part.c:
        bli_acquire_mij()
        bli_acquire_vi()
      The first acquires a scalar object for the (i,j) element of a matrix,
      and the second acquires a scalar object for the ith element of a vector.
    - Added integer support to bli_getsc level-0 operation. This involved
      adding integer support to the bli_*gets level-0 scalar macros.
    - Added a new test module to test amaxv as a level-1v operation. The test
      module works by comparing the value identified by bli_amaxv() to the
      the value found from a reference-like code local to the test module
      source file. In other words, it (intentionally) does not guarantee the
      same index is found; only the same value. This allows for different
      implementations in the case where a vector contains two or more elements
      containing exactly the same floating point value (or values, in the case
      of the complex domain).
    - Removed the directory frame/include/old/.

commit 8d55033c966feed99fcca2a58017c3ab5b1646dc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 27 15:20:58 2016 -0500

    Implemented distributed thrinfo_t management.
    
    Details:
    - Implemented Ricardo Magana's distributed thread info/communicator
      management. Rather that fully construct the thrinfo_t structures, from
      root to leaf, prior to spawning threads, the threads individually
      construct their thrinfo_t trees (or, chains), and do so incrementally,
      as needed, reusing the same structure nodes during subsequent blocked
      variant iterations. This required moving the initial creation of the
      thrinfo_t structure (now, the root nodes) from the _front() functions
      to the bli_l3_thread_decorator(). The incremental "growing" of the tree
      is performed in the internal back-end (ie: _int()) function, and so
      mostly invisible. Also, the incremental growth of the thrinfo_t tree is
      done as a function of the current and parent control tree nodes (as well
      as the parent thrinfo_t node), further reinforcing the parallel
      relationship between the two data structures.
    - Removed the "inner" communicator from thrinfo_t structure definition,
      as well as its id. Changed all APIs accordingly. Renamed
      bli_thrinfo_needs_free_comms() to bli_thrinfo_needs_free_comm().
    - Defined bli_l3_thrinfo_print_paths(), which prints the information
      in an array of thrinfo_t* structure pointers. (Used only as a
      debugging/verification tool.)
    - Deprecated the following thrinfo_t creation functions:
        bli_packm_thrinfo_create()
        bli_l3_thrinfo_create()
      because they are no longer used. bli_thrinfo_create() is now called
      directly when creating thrinfo_t nodes.

commit fd04869ae4d4a3b0ebb9052557c296456bce7c0d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 27 14:14:11 2016 -0500

    Changed configure's 'omp' threading to 'openmp'.
    
    Details:
    - Changed the configure script so that the expected string argument to the
      -t (or --enable-threading=) option that enables OpenMP multithreading is
      'openmp'. The previous expected string, 'omp', is still supported but
      should be considered deprecated.

commit 9424af87209e4e435e2e742430945152690170b0
Merge: efa7341d c0630c40
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 27 12:51:08 2016 -0500

    Merge branch 'compose'

commit 7f32dd57c6bd41c0704341752842277dd6a4c8eb
Author: Shaden Smith <shaden@cs.umn.edu>
Date:   Sat Sep 17 11:33:57 2016 -0500

    Adds sanity check to configuration choice.

commit efa7341df0b0115926aa8a6e8a4ebfb24fdbf11e
Merge: 121c39d4 e1453f68
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Sep 16 11:01:57 2016 -0500

    Merge pull request #92 from ShadenSmith/readme_fix
    
    Fixes broken URL in README.md

commit e1453f68f6afd90ae9a29b7a5faa46aa79bbf741
Author: Shaden Smith <ShadenTSmith@gmail.com>
Date:   Fri Sep 16 09:29:28 2016 -0500

    Fixes broken URL in README.md

commit b922d7563422e14c49a4677bc6ae088a408861ed
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Aug 23 13:38:36 2016 -0500

    Avoid compiling BLAS/CBLAS files when disabled.
    
    Details:
    - Updated the top-level Makefile, build/config.mk.in template, and
      configure script so that object files corresponding to source files
      belonging to the BLAS compatibility layer are not compiled (or archived)
      when the compatibility layer is disabled. (Same for CBLAS.) Thanks
      to Devin Matthews for suggesting this optimization.
    - Slight change to the way configure handles internal variables. Instead
      of converting (overwriting) some, such as enable_blas2blis and
      enable_cblas, from a "yes" or "no" to a "1" or "0" value, the latter are
      now stored in new variables that live alongside the originals (with the
      suffix "_01").  This is convenient since some values need to be
      sed-substituted into the config.mk.in template, which requires "yes" or
      "no", while some need to be written to the bli_config.h.in template,
      which requires "0" or "1".
    
    Updated BLIS4 TOMS citation in README.md.
    
    Added complex gemm micro-kernels for haswell.
    
    Details:
    - Defined cgemm (3x8) and zgemm (3x4) micro-kernels for haswell-based
      architectures. As with their real domain brethren, these kernels perfer
      row storage, (though this doesn't affect most users due to high-level
      optimizations in most level-3 operations that induce a transpose to
      whatever storage preference the kernel may have).
    
    Change-Id: I512ab90784ecbb7cdaee24928d2ccebb544ba5c1

commit 69826110bab2a064ec76457c24843d28f2581281
Merge: 64598ee4 a58dd35e
Author: Pradeep Rao <Pradeep.Rao@amd.com>
Date:   Wed Sep 14 03:26:25 2016 -0400

    Merge "Implemented trsm single precision for lower triangular matrices, files added bli_trsm_l_int_6x16.cfiles modified bli_kernel.h to enable optimized trsm microkernel and test_trsm.c is modified to test trsm single precision" into amd-staging

commit c0630c4024b08750043a2942a3e8a037aa6b6259
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Sep 12 13:59:02 2016 -0500

    Added debugging printf()'s to bli_l3_thrinfo.c.
    
    Details:
    - Added optional printf() statements to print out thread communicator
      info as the thrinfo_t structure is built in bli_l3_thrinfo.c.
    - Minor changes to frame/thread/bli_thrinfo.h.

commit 7b3bf1ffcd7160ccbf6c2518af6d88f6742e4977
Merge: 35509818 121c39d4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 6 15:47:13 2016 -0500

    Merge branch 'master' into compose

commit 121c39d455f2db6f7ce6802ba7f73ad5e088c68c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Sep 5 13:11:42 2016 -0500

    Added complex gemm micro-kernels for haswell.
    
    Details:
    - Defined cgemm (3x8) and zgemm (3x4) micro-kernels for haswell-based
      architectures. As with their real domain brethren, these kernels perfer
      row storage, (though this doesn't affect most users due to high-level
      optimizations in most level-3 operations that induce a transpose to
      whatever storage preference the kernel may have).

commit 35509818cbea1598b123421f81c42120889a03c3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Aug 31 17:34:15 2016 -0500

    Added, moved some thread barriers.
    
    Details:
    - Removed thread barriers from the end of the loop bodies of
      bli_gemm_blk_var1(), bli_gemm_blk_var2(), bli_trsm_blk_var1(),
      and bli_trsm_blk_var2().
    - Moved the thread barrier at the end of bli_packm_int() to the
      end of bli_l3_packm(), and added missing barriers to that function.
    - Removed the no longer necessary (and now incorrect) ochief guard
      in bli_gemm3m3_packa() on the bli_obj_scalar_reset() on C.
    - Thanks to Tyler Smith for help with these changes.

commit 64598ee4cfb86f64abbd4bcef5a82ba0d5565b67
Author: sthangar <Santanu.Thangaraj@amd.com>
Date:   Wed Aug 31 12:54:50 2016 +0530

    fixed the symlink issue
    
    Change-Id: I2186d529f295c576597c189e1ae219bc1a83f955

commit abd61f9fa75d77a96d1491b3e035451ee73238fe
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Aug 30 12:34:19 2016 -0500

    Updated BLIS4 TOMS citation in README.md.

commit 8a2373f26ba8fcd5b2d7b2cc72cb8b2e1f841a03
Author: sthangar <Santanu.Thangaraj@amd.com>
Date:   Mon Aug 29 14:10:45 2016 +0530

    Norm 2 optimization
    
    Change-Id: Ide9decaccd20bf0ccc32c9abb6556e038dceed2b

commit fdc663902347aa252ea88cf09ce24ab748958dff
Author: sthangar <Santanu.Thangaraj@amd.com>
Date:   Mon Aug 29 10:43:38 2016 +0530

    Placed 1 and 1f AMD optimized AVX routines under zen folder
    
    Change-Id: I26795211ef11d232ed794ce36dd0a9c1f8706328

commit 701b9aa3ff028decbf90efac0dca5bd64fe26269
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Aug 26 19:04:45 2016 -0500

    Redesigned control tree infrastructure.
    
    Details:
    - Altered control tree node struct definitions so that all nodes have the
      same struct definition, whose primary fields consist of a blocksize id,
      a variant function pointer, a pointer to an optional parameter struct,
      and a pointer to a (single) sub-node. This unified control tree type is
      now named cntl_t.
    - Changed the way control tree nodes are connected, and what computation
      they represent, such that, for example, packing operations are now
      associated with nodes that are "inline" in the tree, rather than off-
      shoot braches. The original tree for the classic Goto gemm algorithm was
      expressed (roughly) as:
    
        blk_var2 -> blk_var3 -> blk_var1 -> ker_var2
                             |           |
                             -> packb    -> packa
    
      and now, the same tree would look like:
    
        blk_var2 -> blk_var3 -> packb -> blk_var1 -> packa -> ker_var2
    
      Specifically, the packb and packa nodes perform their respective packing
      operations and then recurse (without any loop) to a subproblem. This means
      there are now two kinds of level-3 control tree nodes: partitioning and
      non-partitioning. The blocked variants are members of the former, because
      they iteratively partition off submatrices and perform suboperations on
      those partitions, while the packing variants belong to the latter group.
      (This change has the effect of allowing greatly simplified initialization
      of the nodes, which previously involved setting many unused node fields to
      NULL.)
    - Changed the way thrinfo_t tree nodes are arranged to mirror the new
      connective structure of control trees. That is, packm nodes are no longer
      off-shoot branches of the main algorithmic nodes, but rather connected
      "inline".
    - Simplified control tree creation functions. Partitioning nodes are created
      concisely with just a few fields needing initialization. By contrast, the
      packing nodes require additional parameters, which are stored in a
      packm-specific struct that is tracked via the optional parameters pointer
      within the control tree struct. (This parameter struct must always begin
      with a uint64_t that contains the byte size of the struct. This allows
      us to use a generic function to recursively copy control trees.) gemm,
      herk, and trmm control tree creation continues to be consolidated into
      a single function, with the operation family being used to select
      among the parameter-agnostic macro-kernel wrappers. A single routine,
      bli_cntl_free(), is provided to free control trees recursively, whereby
      the chief thread within a groups release the blocks associated with
      mem_t entries back to the memory broker from which they were acquired.
    - Updated internal back-ends, e.g. bli_gemm_int(), to query and call the
      function pointer stored in the current control tree node (rather than
      index into a local function pointer array). Before being invoked, these
      function pointers are first cast to a gemm_voft (for gemm, herk, or trmm
      families) or trsm_voft (for trsm family) type, which is defined in
      frame/3/bli_l3_var_oft.h.
    - Retired herk and trmm internal back-ends, since all execution now flows
      through gemm or trsm blocked variants.
    - Merged forwards- and backwards-moving variants by querying the direction
      from routines as a function of the variant's matrix operands. gemm and
      herk always move forward, while trmm and trsm move in a direction that
      is dependent on which operand (a or b) is triangular.
    - Added functions bli_thread_get_range_mdim(), bli_thread_get_range_ndim(),
      each of which takes additional arguments and hides complexity in managing
      the difference between the way ranges are computed for the four families
      of operations.
    - Simplified level-3 blocked variants according to the above changes, so that
      the only steps taken are:
      1. Query partitioning direction (forwards or backwards).
      2. Prune unreferenced regions, if they exist.
      3. Determine the thread partitioning sub-ranges.
      <begin loop>
        4. Determine the partitioning blocksize (passing in the partitioning
           direction)
        5. Acquire the curren iteration's partitions for the matrices affected
           by the current variants's partitioning dimension (m, k, n).
        6. Call the subproblem.
      <end loop>
    - Instantiate control trees once per thread, per operation invocation.
      (This is a change from the previous regime in which control trees were
      treated as stateless objects, initialized with the library, and shared
      as read-only objects between threads.) This once-per-thread allocation
      is done primarily to allow threads to use the control tree as as place
      to cache certain data for use in subsequent loop iterations. Presently,
      the only application of this caching is a mem_t entry for the packing
      blocks checked out from the memory broker (allocator). If a non-NULL
      control tree is passed in by the (expert) user, then the tree is copied
      by each thread. This is done in bli_l3_thread_decorator(), in
      bli_thrcomm_*.c.
    - Added a new field to the context, and opid_t which tracks the "family"
      of the operation being executed. For example, gemm, hemm, and symm are
      all part of the gemm family, while herk, syrk, her2k, and syr2k are
      all part of the herk family. Knowing the operation's family is necessary
      when conditionally executing the internal (beta) scalar reset on on
      C in blocked variant 3, which is needed for gemm and herk families,
      but must not be performed for the trmm family (because beta has only
      been applied to the current row-panel of C after the first rank-kc
      iteration).
    - Reexpressed 3m3 induced method blocked variant in frame/3/gemm/ind
      to comform with the new control tree design, and renamed the macro-
      kernel codes corresponding to 3m2 and 4m1b.
    - Renamed bli_mem.c (and its APIs) to bli_memsys.c, and renamed/relocated
      bli_mem_macro_defs.h from frame/include to frame/base/bli_mem.h.
    - Renamed/relocated bli_auxinfo_macro_defs.h from frame/include to
      frame/base/bli_auxinfo.h.
    - Fixed a minor bug whereby the storage-to-ukr-preference matching
      optimization in the various level-3 front-ends was not being applied
      properly when the context indicated that execution would be via an
      induced method. (Before, we always checked the native micro-kernel
      corresponding to the datatype being executed, whereas now we check
      the native micro-kernel corresponding to the datatype's real projection,
      since that is the micro-kernel that is actually used by induced methods.
    - Added an option to the testsuite to skip the testing of native level-3
      complex implementations. Previously, it was always tested, provided that
      the c/z datatypes were enabled. However, some configurations use
      reference micro-kernels for complex datatypes, and testing these
      implementations can slow down the testsuite considerably.

commit a58dd35ed7b5b77a6b272655d2edd7a822b8fa87
Author: Kiran Varaganti <Kiran.Varaganti@amd.com>
Date:   Fri Aug 26 14:55:12 2016 +0530

    Implemented trsm single precision for lower triangular matrices, files added bli_trsm_l_int_6x16.cfiles modified bli_kernel.h to enable optimized trsm microkernel and test_trsm.c is modified to test trsm single precision
    
    Change-Id: Ibddf989f4aad577e89558673e1038cf6ece654d9

commit 73517f522b69de429dd7f3df60a70c068149ab28
Merge: c6f5c215 50293da3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Aug 23 13:46:59 2016 -0500

    Merge branch 'master' into compose

commit 50293da38d5f2b7be9bbc94b9e85aacb6a10f672
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Aug 23 13:38:36 2016 -0500

    Avoid compiling BLAS/CBLAS files when disabled.
    
    Details:
    - Updated the top-level Makefile, build/config.mk.in template, and
      configure script so that object files corresponding to source files
      belonging to the BLAS compatibility layer are not compiled (or archived)
      when the compatibility layer is disabled. (Same for CBLAS.) Thanks
      to Devin Matthews for suggesting this optimization.
    - Slight change to the way configure handles internal variables. Instead
      of converting (overwriting) some, such as enable_blas2blis and
      enable_cblas, from a "yes" or "no" to a "1" or "0" value, the latter are
      now stored in new variables that live alongside the originals (with the
      suffix "_01").  This is convenient since some values need to be
      sed-substituted into the config.mk.in template, which requires "yes" or
      "no", while some need to be written to the bli_config.h.in template,
      which requires "0" or "1".

commit 22dd6a353ddb56614309c01533b1a94c9fd32bca
Merge: cdfb3c3f f20ed388
Author: praveeng <praveen.g@amd.com>
Date:   Tue Aug 23 15:15:35 2016 +0530

    Merge master code as on 2016_08_23 to amd-staging branch by praveeng
    
     Changes to be committed:
            modified:   frame/thread/bli_mutex_openmp.h
            modified:   frame/thread/bli_mutex_pthreads.h
    
    Change-Id: Ica522edbb1d0173f53f38d5057b1f7aef73666be

commit c6f5c215ee793d03ea834469fc2adc53feaffc42
Merge: d52cb767 16a4c7a8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Aug 22 17:33:02 2016 -0500

    Merge branch 'master' into compose

commit f20ed3885d628992fab88690f629a5a2bab3eb88
Merge: 02ac597e 4bc842ca
Author: praveeng <praveen.g@amd.com>
Date:   Mon Aug 22 15:27:33 2016 +0530

    Merge branch 'master' of https://github.com/clMathLibraries/blis-amd for "Fixed bugs in bli_mutex_init() and friends."

commit 02ac597e4b9be2670d9fff65d28552f8e1ec81b3
Author: praveeng <praveen.g@amd.com>
Date:   Thu Jul 28 15:11:08 2016 +0530

    Revert commits 357c990bdd7bd5667aac5adf1bab3712973e7414
    
    Change-Id: I12a34456d7eed93fda4369e76bcddb42ba7ccb99

commit 84e41cc73c9c87ce64582acd4264b8e1b5316482
Author: praveeng <praveen.g@amd.com>
Date:   Thu Jul 28 15:01:36 2016 +0530

    Revert commits 8aee306
    
    Change-Id: I3dd999c77c6779332a40dbb84371ca487216f189

commit 30ccfcee82db93d0109d1571242e2db925e95d0a
Author: praveeng <praveen.g@amd.com>
Date:   Mon Jul 25 14:14:00 2016 +0530

    removed changes from readme file which are giving confilcts
    
    Change-Id: Ic71ad1313e1404fed444e899466043704d875af6

commit aeca25cd63fc8971f8fe7809599c57853f976548
Author: praveeng <praveen.g@amd.com>
Date:   Tue Jul 5 16:51:23 2016 +0530

    first commit
    
    Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2

commit 6b2274864b36fd1019d97bcc4ca6dd7a57ef16d9
Author: praveeng <praveen.g@amd.com>
Date:   Tue Jul 5 15:00:31 2016 +0530

    small modification to readme for  git push test
    
    Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a

commit daa7a9ecb25982f2551adbd95e65f8ba97cfe944
Author: praveeng <praveen.g@amd.com>
Date:   Tue Jul 5 16:51:23 2016 +0530

    first commit
    
    Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2

commit 5f66a4aa05aeffcb6eb587851d78d9527319466c
Author: praveeng <praveen.g@amd.com>
Date:   Tue Jul 5 15:00:31 2016 +0530

    small modification to readme for  git push test
    
    Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a

commit c6cbd78d2388c08824822b91a1c36ac4349bb67f
Author: praveeng <praveen.g@amd.com>
Date:   Thu Jul 28 15:11:08 2016 +0530

    Revert commits 357c990bdd7bd5667aac5adf1bab3712973e7414
    
    Change-Id: I12a34456d7eed93fda4369e76bcddb42ba7ccb99

commit 9219a9060762525f87ebbf556d78fe8621858513
Author: praveeng <praveen.g@amd.com>
Date:   Thu Jul 28 15:01:36 2016 +0530

    Revert commits 8aee306
    
    Change-Id: I3dd999c77c6779332a40dbb84371ca487216f189

commit 728573296efa7cf14d2381570e116509dfe2a240
Author: praveeng <praveen.g@amd.com>
Date:   Mon Jul 25 14:14:00 2016 +0530

    removed changes from readme file which are giving confilcts
    
    Change-Id: Ic71ad1313e1404fed444e899466043704d875af6

commit ad7862e291c240505c733a41d231b1a126ade73c
Author: praveeng <praveen.g@amd.com>
Date:   Tue Jul 5 16:51:23 2016 +0530

    first commit
    
    Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2

commit ad4b471a25ce77867295e5529dfc787e7c18b03f
Author: praveeng <praveen.g@amd.com>
Date:   Tue Jul 5 15:00:31 2016 +0530

    small modification to readme for  git push test
    
    Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a

commit 55d641363fcd8bdfdabbd7c22822fa2d0b7f3fa6
Author: praveeng <praveen.g@amd.com>
Date:   Tue Jul 5 16:51:23 2016 +0530

    first commit
    
    Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2

commit f3b6b15f6d591d323802bd6c81c522a02056506d
Author: praveeng <praveen.g@amd.com>
Date:   Tue Jul 5 15:00:31 2016 +0530

    small modification to readme for  git push test
    
    Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a

commit 16a4c7a823d60707ed9272f5d36e5c5d54c0ba4b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Aug 19 11:38:36 2016 -0500

    Fixed bugs in bli_mutex_init() and friends.
    
    Details:
    - Fixed a couple of bugs that affected OpenMP and POSIX threads
      configurations that resulted in compiler errors and warnings due
      to type mismatch, and in the case of pthreads, a missing function
      argument. The bugs are fairly recent, introduced in a017062.

commit c8e4ef93953ba2b79fb7e0973c08469c0e28a2cd
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Aug 3 16:13:03 2016 -0500

    Add prefetchw to 30x8 kernel.

commit 4b5a2f3d6e7ffeb5cc2be8448554f5c2083ad68f
Merge: 380736bf 9f52a587
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Aug 3 16:09:51 2016 -0500

    Merge remote-tracking branch 'origin/knl' into knl
    
    # Conflicts:
    #       kernels/x86_64/knl/3/bli_dgemm_opt_24x8.c

commit 380736bfe955efbdd7274c90b6fd635688e83bc4
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Aug 3 16:08:28 2016 -0500

    Add (new) 30x8 KNL kernel and fix non-scatter prefetch bug.

commit 9f52a587dee855daa73c194e41b6951416544e9a
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Aug 3 16:03:53 2016 -0500

    Try prefetchw[t1] instead of regular prefetch for C.

commit 8945a1512d366bc6a8a85718d12cbf5de6f2898b
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Aug 3 11:28:24 2016 -0500

    This version gets ~1550 GFLOPs on KNL wuth 16x4.

commit cdfb3c3f29d321033fca106aa58ab67ead90a95d
Merge: 50a2f2ef 4bc842ca
Author: praveeng <praveen.g@amd.com>
Date:   Fri Jul 29 12:45:04 2016 +0530

    Merge master code as on 2016_07_29 to amd-staging branch by praveeng
    
    Change-Id: Ic78b84d8b8d10158fb2a612f9a64bbc7b1f9b486

commit 4bc842ca3a64e658c0808bfe4c5693a5ace97923
Merge: 117f8838 b0d510bf
Author: praveeng <praveen.g@amd.com>
Date:   Thu Jul 28 17:32:12 2016 +0530

    Merge branch 'master' of  publicrepo

commit 117f8838511a478aa16137e770d27dd21f4227c5
Author: praveeng <praveen.g@amd.com>
Date:   Thu Jul 28 15:11:08 2016 +0530

    Revert commits 357c990bdd7bd5667aac5adf1bab3712973e7414
    
    Change-Id: I12a34456d7eed93fda4369e76bcddb42ba7ccb99

commit 2fcdc28f1055d385b2e662aa920fb97c472394d7
Author: praveeng <praveen.g@amd.com>
Date:   Thu Jul 28 15:01:36 2016 +0530

    Revert commits 8aee306
    
    Change-Id: I3dd999c77c6779332a40dbb84371ca487216f189

commit 1b5d104afe0628b8b6c0650f1e58cfb08be67004
Author: praveeng <praveen.g@amd.com>
Date:   Mon Jul 25 14:14:00 2016 +0530

    removed changes from readme file which are giving confilcts
    
    Change-Id: Ic71ad1313e1404fed444e899466043704d875af6

commit d81273047bff56501e9413a90991d3d1f8b56a06
Author: praveeng <praveen.g@amd.com>
Date:   Tue Jul 5 16:51:23 2016 +0530

    first commit
    
    Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2

commit 65905c3011a11cda95761681d4ae84337e46bdb5
Author: praveeng <praveen.g@amd.com>
Date:   Tue Jul 5 15:00:31 2016 +0530

    small modification to readme for  git push test
    
    Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a

commit 23cca231be10fe1797aed451bcbc69d38c78bc0c
Author: praveeng <praveen.g@amd.com>
Date:   Tue Jul 5 16:51:23 2016 +0530

    first commit
    
    Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2

commit 922e3091702f25e3287b417719a33adbd5bbf138
Author: praveeng <praveen.g@amd.com>
Date:   Tue Jul 5 15:00:31 2016 +0530

    small modification to readme for  git push test
    
    Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a

commit b0d510bf0e4dfd177f9e4ae0069f41921e2ecdc1
Author: praveeng <praveen.g@amd.com>
Date:   Thu Jul 28 15:11:08 2016 +0530

    Revert commits 357c990bdd7bd5667aac5adf1bab3712973e7414
    
    Change-Id: I12a34456d7eed93fda4369e76bcddb42ba7ccb99

commit 5ebeece5b4a8df81d59ca7558b278a4263d15128
Author: praveeng <praveen.g@amd.com>
Date:   Thu Jul 28 15:01:36 2016 +0530

    Revert commits 8aee306
    
    Change-Id: I3dd999c77c6779332a40dbb84371ca487216f189

commit 6ce4c022ebdea00c2b951090e3c2e9e88735b9ce
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Jul 27 16:26:36 2016 -0500

    Switch back to 24x8. I could only squeeze 24.5GFLOP out of 8x24, and scalability is not improved.

commit d52cb7671509592a8078729477b40b60380518a2
Merge: 95abea46 c31b1e7b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 27 16:04:55 2016 -0500

    Merge branch 'master' into compose

commit c31b1e7b9d659b96433a87e5aecb90e457a104cc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 27 15:58:07 2016 -0500

    Relax alignment restrictions for sandybridge ukrs.
    
    Details:
    - Relaxed the base pointer and leading dimension alignment restrictions
      in the sandybridge gemm microkernels, allowing the use of vmovups/vmovupd
      instead of vmovaps/vmovapd. These change mimic those made to the haswell
      microkernels in e0d2fa0 and ee2c139.
    - Updated testsuite modules as well as standalone test drivers in 'test'
      directory to use DBL_MAX as the initial time candidate. Thanks to Devin
      Matthews for suggesting this change.
    - Inserted #include "float.h" into bli_system.h (to gain access to DBL_MAX).
    - Minor update (vis-a-vis contexts) to driver code in test/3m4m.

commit b8f2b55532849d45d379afbdd05a52ff6100800d
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Jul 27 15:22:55 2016 -0500

    Try an 8x24 kernel for the hell of it.

commit 7ede5863ae3567f7c0852efc2d5cd649ca19e0f3
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Jul 27 13:41:27 2016 -0600

    Allocate pack buffer on MCDRAM for KNL.

commit ad89ed2e829c7b261d8ba0998a3cb83ad576ee04
Merge: 2c9de740 81e2b05f
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Jul 27 11:45:40 2016 -0500

    Merge branch 'knl' of github.com:devinamatthews/blis into knl

commit 2c9de740edb66c4692c200731763bbd1d3171ccb
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Jul 27 11:44:54 2016 -0500

     This version gets ~26GF on one core.

commit 81e2b05f31bca4e1e1676e7b533d1868d9f9be33
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Jul 27 11:39:05 2016 -0500

    Add optimized packing kernels for KNL.

commit a7d8ca97b8d835c32d90ff20a565c82733f014a8
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Jul 25 15:15:13 2016 -0500

    All fixed.

commit 963d0393b023f4134bb0c682923faf9964c0e645
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Jul 25 14:40:53 2016 -0500

    Add 24xk pack kernel.

commit 117b76739afba481768897d2580f8365d3345417
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Jul 25 13:53:07 2016 -0500

    In the midst of debugging.

commit 8c0a4fd1d3535d608a9a309a61ffee0a73c3646f
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Jul 25 13:09:24 2016 -0500

    Fix some row/column confusion.

commit c44f9f96930312125b15e64c326ab5ab5cc02633
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Jul 25 12:02:24 2016 -0500

    Simplify displacements -- clang assembler was badly botching EVEX compressed displacements giving false alarms for instruction length.

commit e0cce177cc1b47ec9f11ac0556241feaa3564df1
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Jul 25 10:02:25 2016 -0500

    Minor fixes for 8x24 KNL kernel.

commit 50a2f2efcbeb46537f1deaa8e44dc579a4e49eb8
Merge: 1aa77dfc cfd46c88
Author: praveeng <praveen.g@amd.com>
Date:   Mon Jul 25 17:01:20 2016 +0530

    Merge master code as on 2016_07_25 to amd-staging branch by praveeng
    
    Change-Id: I84886ae241db2aac0bef6b7ef399f04aa8bca16d

commit cfd46c88d59c8f61d5e7cf768d606e4c44623584
Merge: f493bf4d a017062f
Author: praveeng <praveen.g@amd.com>
Date:   Mon Jul 25 15:38:13 2016 +0530

    Merge remote-tracking branch 'publicrepo/master'

commit f493bf4d704fe0e967783cd6e6877d3302c056a1
Author: praveeng <praveen.g@amd.com>
Date:   Mon Jul 25 14:14:00 2016 +0530

    removed changes from readme file which are giving confilcts
    
    Change-Id: Ic71ad1313e1404fed444e899466043704d875af6

commit 65735bbedf75784c48bd11e05b3fdc98fc66b4bc
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Sun Jul 24 21:50:32 2016 -0500

    Switch to 24x8 kernel, unrolled by 16.

commit 45d5dc97177117220bd9dd0abf85aafc185acad1
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Sun Jul 24 14:25:26 2016 -0500

    Add 24x8 "KNC-style" kernel for KNL.

commit 95abea46f86816fddfc9ff0abfa52880801461be
Merge: d0dfe5b5 a017062f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jul 23 15:38:33 2016 -0500

    Merge branch 'master' into compose

commit a017062fdf763037da9d971a028bb07d47aa1c8a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jul 22 17:02:59 2016 -0500

    Integrated "memory broker" (membrk_t) abstraction.
    
    Details:
    - Integrated a patch originally authored and submitted by Ricardo Magana
      of HP Enterprise. The changeset inserts use of a new object type, membrk_t,
      (memory broker) that allows multiple sets of memory pools on, for example,
      separate NUMA nodes, each of which has a separate memory space.
    - Added membrk field to cntx_t and defined corresponding accessor macros.
    - Added membrk field to mem_t object and defined corresponding accessor macros.
    - Created new bli_membrk.c file, which contains the new memory broker API,
      including:
        bli_membrk_init(), bli_membrk_finalize()
        bli_membrk_acquire_[mv](), bli_membrk_release(),
        bli_membrk_init_pools(), bli_membrk_reinit_pools(),
        bli_membrk_finalize_pools(),
        bli_membrk_pool_size()
    - In bli_mem.c, changed function calls to
        bli_mem_init_pools()     -> bli_membrk_init()
        bli_mem_reinit_pools()   -> bli_membrk_reinit()
        bli_mem_finalize_pools() -> bli_membrk_finalize()
    - In bli_packv_init.c, bli_packm_init.c, changed function calls to:
        bli_mem_acquire_[mv]() -> bli_membrk_acquire_[mv]()
        bli_mem_release()      -> bli_membrk_release()
    - Added bli_mutex.c and related files to frame/thread. These files define
      abstract mutexes (locks) and corresponding APIs for pthreads, openmp, or
      single-threaded execution. This new API is employed within functions
      such as bli_membrk_acquire_[mv]() and bli_membrk_release().

commit 8ff2e069c48c12fd06b9c48c6b3aeb4ea9b0e6e1
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Jul 22 16:22:26 2016 -0500

    Add 4x unrolled variant for KNL microkernel.

commit 9cb2ed9b0c25f31a22c1c9719b062fa665ad7adf
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Jul 22 16:10:30 2016 -0500

    Git rid of one RBX update.

commit 451bde076f0320d60cd2475cfb048ac4a2b798bb
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Jul 22 15:43:00 2016 -0500

    Add some more knobs to twiddle for KNL microkernel.

commit 8c6e621c099521e7a4d87e007bb8224faa5f33a3
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Jul 22 15:05:15 2016 -0500

    Make knl conform to new kernel dir structure.

commit ce7214c6618d6f22f4ce2ee452336236916d1f30
Merge: 119d0399 ce59f811
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Jul 22 14:59:53 2016 -0500

    Merge remote-tracking branch 'origin/master' into knl

commit ce59f81108ec9aea918a7e77030da8acfdd397ce
Merge: ff41153f 707a2b7f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jul 22 14:48:14 2016 -0500

    Merge pull request #88 from devinamatthews/32bit-dim_t
    
    Handle 32-bit dim_t in 64-bit microkernels.

commit 707a2b7faca137cca7cab7b11a12c44ddaf7ad53
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Jul 22 13:49:44 2016 -0500

    Somehow forgot the most important microkernel.

commit 47ec045056351ac4f0791c071fa0daaa81699c8c
Merge: 08f1d6b6 ff41153f
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Jul 22 13:45:23 2016 -0500

    Merge remote-tracking branch 'upstream/master' into 32bit-dim_t

commit 08f1d6b6fa344275de0f675f69737145ccf6646a
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Jul 22 13:44:37 2016 -0500

    Use 64-bit intermediate variable for k for architectures that do 64-bit loads in case dim_t is 32-bit.

commit ff41153f4eb7f38ed94bdd9a3fd81fb979f3f401
Merge: f9214ced e0d2fa0d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jul 22 13:21:03 2016 -0500

    Merge pull request #86 from devinamatthews/haswell-vmovups
    
    Remove alignment restrictions on C in haswell kernel.

commit e0d2fa0d835ab49366aeb790363bb2b571d36ed8
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Jul 22 12:56:51 2016 -0500

    Relax alignment restrictions for haswell sgemm.

commit f9214ced97392861f5a0ea72abfcf6f41faf674c
Merge: 413d62ac 08666eaa
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jul 22 12:16:39 2016 -0500

    Merge pull request #85 from devinamatthews/qopenmp
    
    Change -openmp to -fopenmp for icc.

commit ee2c139df6ad53c6aec8a67ab23b3b1912e8d259
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Jul 22 12:06:03 2016 -0500

    Remove alignment restrictions on C in haswell kernel.

commit 08666eaa20d8a31f2f92f944e5bfa7c1558c53e4
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Jul 22 11:07:34 2016 -0500

    Change -openmp to -fopenmp for icc.

commit 119d0399428905053265f3aca1cc8cc1fde3b363
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Jul 22 10:23:31 2016 -0500

    Add 8x24 KNL kernel.

commit 1aa77dfc1dc183d16e0b6a1196d9c263f021e83d
Merge: 9101a9c8 ec9f5983
Author: praveeng <praveen.g@amd.com>
Date:   Thu Jul 21 14:22:40 2016 +0530

    Merge master code as on 2016_07_21 to amd-staging branch by praveeng
    
    Change-Id: Ic7d0a21101358f08147736e7f1884e7409937344

commit b58cda9eba0c1e175460aae109baf792d29ba5bf
Merge: 318f063d 413d62ac
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue Jul 19 14:09:09 2016 -0500

    Merge remote-tracking branch 'origin/master' into knl
    
    # Conflicts:
    #       frame/base/bli_threading.h
    #       frame/include/blis.h
    #       frame/thread/bli_thread.c

commit ec9f59836b32260c29ff1cd24e629c7d8de14992
Merge: 197e182f 763babe4
Author: praveeng <praveen.g@amd.com>
Date:   Mon Jul 18 12:56:25 2016 +0530

    Merge branch 'master' of https://github.com/clMathLibraries/blis-amd

commit 197e182fcbf1340fd4a202fac58bea6cfcfa9e2f
Author: praveeng <praveen.g@amd.com>
Date:   Tue Jul 5 16:51:23 2016 +0530

    first commit
    
    Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2

commit 41fb32711031e7ec86b062aa7f53255d1f5905e2
Author: praveeng <praveen.g@amd.com>
Date:   Tue Jul 5 15:00:31 2016 +0530

    small modification to readme for  git push test
    
    Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a

commit d0dfe5b5372cc7558ee9c4104b29f82eecc7ed61
Merge: 31def12e 413d62ac
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jul 14 11:01:06 2016 -0500

    Merge branch 'master' into compose

commit 9101a9c880e3934f8a63ffc7fe15f5fc1077a73d
Author: sthangar <Santanu.Thangaraj@amd.com>
Date:   Wed Jul 13 16:51:14 2016 +0530

    Checked in optimized 1V kernels along with benchmark codes. Also incorporated review comments for 1F kernels
    
    Change-Id: I035c0d39e6b0bed28e6e2041242186c49f6ed55b

commit 763babe488880b42c86c7fc207aa7665bd0ff9f7
Merge: 357c990b 413d62ac
Author: praveeng <praveen.g@amd.com>
Date:   Wed Jul 13 11:57:19 2016 +0530

    Merge remote-tracking branch 'publirepo/master'

commit 413d62aca28edabba56605a9f87d5b715831e1db
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 12 15:02:52 2016 -0500

    README update (use official ACM TOMS links).

commit dfa431f696db2df4065ea454df268a2e0bc02eac
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 12 14:21:19 2016 -0500

    README update (BLIS2 TOMS article now in-print).

commit 357c990bdd7bd5667aac5adf1bab3712973e7414
Author: praveeng <praveen.g@amd.com>
Date:   Tue Jul 5 16:51:23 2016 +0530

    first commit
    
    Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2

commit 8aee306300adb099b66036f2c2f7f3996433cf49
Author: praveeng <praveen.g@amd.com>
Date:   Tue Jul 5 15:00:31 2016 +0530

    small modification to readme for  git push test
    
    Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a

commit 31def12e2629f187e40f93f6bae9e26a6c2660e2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jun 30 15:19:20 2016 -0500

    First phase of control tree redesign.
    
    Details:
    - These changes constitute the first set of changes in preparation to
      revamping the structure and use of control trees in BLIS. Modifications
      in this commit don't affect the control tree code yet, but rather lay
      the groundwork.
    - Defined wrappers for the following functions, where the the wrappers
      each take a direction parameter of a new enumerated type (BLIS_BWD or
      BLIS_FWD), dir_t, and executes the correct underlying function.
      - bli_acquire_mpart_*() and _vpart_*()
      - bli_*_determine_kc_[fb]()
      - bli_thread_get_range_*() and bli_thread_get_range_weighted_*()
    - Consolidated all 'f' (forwards-moving) and 'b' (backwards-moving)
      blocked variants for trmm and trsm, and renamed gemm and herk variants
      accordingly. The direction is now queried via routines such as
      bli_trmm_direct(), which deterines the direction from the implied side
      and uplo parameters. For gemm and herk, it is uncondtionally BLIS_FWD.
    - Defined wrappers to parameter-specific macrokernels for herk, trmm, and
      trsm, e.g. bli_trmm_xx_ker_var2(), that execute the correct underlying
      macrokernel based on the implied parameters. The same logic used to
      choose the dir_t in _direct() functions is used here.
    - Simplified the function pointer arrays in _int() functions given the
      consolidation and dir_t querying mentioned above.
    - Function signature (whitespace) reformatting for various functions.
    - Removed old code in various 'old' directories.

commit 405c9d46344d93c3eab5572b233900b50ca50d68
Author: sthangar <Santanu.Thangaraj@amd.com>
Date:   Wed Jun 22 12:18:54 2016 +0530

    Check-in the fused kernels optimized for Zen
    
    Change-Id: I7b2f467b960e7b9a285f06e47be87de122e5fa24

commit 232754feecf29452987666b9f5ebba2619bfd0b0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jun 21 14:25:39 2016 -0500

    Fixed compiler warning in rand[vm], randn[vm].
    
    Details:
    - Fixed compiler warnings about unused variables related to the disabling
      of normalization in the structured cases of the rand[vm] and randn[vm]
      operations.

commit a89555d1605574f3685813dcc972b636dd61264d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jun 17 14:08:35 2016 -0500

    Added randn[vm] operations, support in testsuite.
    
    Details:
    - Defined a new randomization operation, randn, on vectors and matrices.
      The randnv and randnm operations randomize each element of the target
      object with values from a narrow range of values. Presently, those
      values are all integer powers of two, but they do not need to be powers
      of two in order to achieve the primary goal, which is to initialize
      objects that can be operated on with plenty of precision "slack"
      available to allow computations that avoid roundoff. Using this method
      of randomization makes it much more likely that testsuite residuals of
      properly-functioning operations are close to zero, if not exactly zero.
    - Updated existing randomization operations randv and randm to skip
      special diagonal handling and normalization for matrices with structure.
      This is now handled by the testsuite modules by explicitly calling a
      testsuite function that loads the diagonal (and scales off-diagonal
      elements).
    - Added support for randnv and randnm in the testsuite with a new switch
      in input.general that universally toggles between use of the classic
      randv/randm, which use real values on the interval [-1,1], and
      randnv/randnm, which use only values from a narrow range. Currently,
      the narrow range is: +/-{2^0, 2^-1, 2^-2, 2^-3, 2^-4, 2^-5, 2^-6}, as
      well as 0.0.
    - Updated testsuite modules so that a testsutie wrapper function is called
      instead of directly calling the randomization operations (such as
      bli_randv() and bli_randm()). This wrapper also takes a bool_t that
      indicates whether the object's elements should be normalized. (NOTE: As
      alluded to above, in the test modules of triangular solve operations such
      as trsv and trsm, we perform the extra step of loading the diagonal.)
    - Defined a new level-0 operation, invertsc, which inverts a scalar.
    - Updated the abval2ris and sqrt2ris level-0 macros to avoid an unlikely
      but possible divide-by-zero.
    - Updated function signature and prototype formatting in testsuite.

commit 318f063dcbd8b594969e401bc99146d24b01066a
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Jun 8 17:46:50 2016 -0500

    Add new KNL microkernel derived from Haswell.

commit 096895c5d538a7f8817603d7cf28c52e99340def
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 6 13:32:04 2016 -0500

    Reorganized code, APIs related to multithreading.
    
    Details:
    - Reorganized code and renamed files defining APIs related to multithreading.
      All code that is not specific to a particular operation is now located in a
      new directory: frame/thread. Code is now organized, roughly, by the
      namespace to which it belongs (see below).
    - Consolidated all operation-specific *_thrinfo_t object types into a single
      thrinfo_t object type. Operation-specific level-3 *_thrinfo_t APIs were
      also consolidated, leaving bli_l3_thrinfo_*() and bli_packm_thrinfo_*()
      functions (aside from a few general purpose bli_thrinfo_*() functions).
    - Renamed thread_comm_t object type to thrcomm_t.
    - Renamed many of the routines and functions (and macros) for multithreading.
      We now have the following API namespaces:
      - bli_thrinfo_*(): functions related to thrinfo_t objects
      - bli_thrcomm_*(): functions related to thrcomm_t objects.
      - bli_thread_*(): general-purpose functions, such as initialization,
        finalization, and computing ranges. (For now, some macros, such as
        bli_thread_[io]broadcast() and bli_thread_[io]barrier() use the
        bli_thread_ namespace prefix, even though bli_thrinfo_ may be more
        appropriate.)
    - Renamed thread-related macros so that they use a bli_ prefix.
    - Renamed control tree-related macros so that they use a bli_ prefix (to be
      consistent with the thread-related macros that were also renamed).
    - Removed #undef BLIS_SIMD_ALIGN_SIZE from dunnington's bli_kernel.h. This
      #undef was a temporary fix to some macro defaults which were being applied
      in the wrong order, which was recently fixed.

commit 232530e88ff99f37abcae5b6fb5319a9a375a45f
Merge: 4bcabd1b eef37f8b
Author: Tyler Michael Smith <tms@cs.utexas.edu>
Date:   Wed Jun 1 15:14:10 2016 -0500

    Merge commit 'refs/pull/81/head' of https://github.com/flame/blis
    
    Conflicts:
            frame/base/bli_threading_pthreads.c
            frame/base/bli_threading_pthreads.h

commit 4bcabd1bf60688c38cf562459fc5e8be8b831756
Author: Tyler Michael Smith <tms@cs.utexas.edu>
Date:   Wed Jun 1 13:27:28 2016 -0500

    Use spin locks instead of pthread barriers

commit eef37f8b4d81845a6ba4bf25586d32b50c3e8a68
Author: Jeff Hammond <jeff.science@gmail.com>
Date:   Sun May 29 22:28:13 2016 -0700

    use GCC intrinsic instead of pthread_mutex for atomic increment and fetch

commit 9dcd6f05c4c3ff2ce7cd87a9951a96ebef22681e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 24 13:15:32 2016 -0500

    Implemented developer-configurable malloc()/free().
    
    Details:
    - Replaced all instances of bli_malloc() and bli_free() with one of:
      - bli_malloc_pool()/bli_free_pool()
      - bli_malloc_user()/bli_free_user()
      - bli_malloc_intl()/bli_free_intl()
      each of which can be configured to call malloc()/free() substitutes,
      so long as the substitute functions have the same function type
      signatures as malloc() and free() defined by C's stdlib.h. The _pool()
      function is called when allocating blocks for the memory pools (used
      for packing buffers, primarily), the _user() function is called when
      obj_t's are created (via bli_obj_create() and friends), and the _intl()
      function is called for internal use by BLIS, such as when creating
      control tree nodes or temporary buffers for manipulating internal data
      structures. Substitutes for any of the three types of bli_malloc() may
      be specified by #defining the following pairs of cpp macros in
      bli_kernel.h:
      - BLIS_MALLOC_POOL/BLIS_FREE_POOL
      - BLIS_MALLOC_USER/BLIS_FREE_USER
      - BLIS_MALLOC_INTL/BLIS_FREE_INTL
      to be the name of the substitute functions. (Obviously, the object
      code that contains these functions must be provided at link-time.)
      These macros default to malloc() and free(). Subsitute functions are
      also automatically prototyped by BLIS (in bli_malloc_prototypes.h).
    - Removed definitions for bli_malloc() and bli_free().
    - Note that bli_malloc_pool() and bli_malloc_user() are now defined in
      terms of a new function, bli_malloc_align(), which aligns memory to an
      arbitrary (power of two) alignment boundary, but does so manually,
      whereas before alignment was performed behind the scenes by
      posix_memalign(). Currently, bli_malloc_intl() is defined in terms
      of bli_malloc_noalign(), which serves as a simple wrapper to the
      designated function that is passed in (e.g. BLIS_MALLOC_INTL).
      Similarly, there are bli_free_align() and bli_free_noalign(), which
      are used in concert with their bli_malloc_*() counterparts.

commit 9dd440109a9d964f5cd286e9f83c487ad703e1e4
Author: Jeff Hammond <jeff.science@gmail.com>
Date:   Sat May 21 15:21:58 2016 -0700

    fix 404 link to BuildSystem
    
    Google Code is dead.  Long live GitHub!

commit d309f20b7376a68efa3b864ad790c2021c071655
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed May 18 15:13:53 2016 -0500

    Added alignment switch to testsuite.
    
    Details:
    - Added a new input parameter to input.general that globally toggles
      whether testsuite tests are performed on objects whose buffers and
      leading dimensions have been aligned, and changed the implementation
      of libblis_test_mobj_create() to employ alignment (or not) regardless
      of whether row, column, or general storage is being tested.
    - Updated configure script's "--help" text to indicate default behavior
      for internal integer type size and BLAS/CBLAS integer type size
      options.

commit 32db0adc218ea4ae370164dbe8d23b41cd3526d3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 17 15:20:16 2016 -0500

    Generate prototypes for user-defined packm kernels.
    
    Details:
    - Created template prototypes for packm kernels (in bli_l1m_ker.h), and
      then redefined reference packm kernels' prototyping headers in terms of
      this template, as is already done for level-1v, -1f, and -3 kernels.
    - Automatically generate prototypes for user-defined packm kernels in
      bli_kernel_prototypes.h (using the new template prototypes in
      bli_l1m_ker.h).
    - Defined packm kernel function types in bli_l1m_ft.h, including for
      packm kernels specific to induced methods, which are now used in
      bli_packm_cxk.c and friends rather than using a locally-defined
      function type.
    - In bli_packm_cxk.c, extended function pointer for packm kernels array
      from out to index 31 (from previous maximum of 17). This allows us to
      store the unrolled 30xk kernel in the array for use (on knc, for
      example). Note: This should have been done a long time ago.

commit e3bd5ca64ae7c190ba689396c0de687b829a11fe
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Thu May 12 20:54:13 2016 -0500

     Fix SIMD definitions in KNL config, and a couple of fixes to C update.

commit 4fe02e3d497995d94d34d3fcf5af895084cfc8b9
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Thu May 12 20:53:58 2016 -0500

    Move bli_kernel.h before bli_threading.h in order of inclusion in blis.h.

commit 4bcf1b35abea3f3dfc8f2fe462dcf155cf199e55
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed May 11 16:09:49 2016 -0500

    Fixed bli_get_range_*() bugs in trsm variants.
    
    Details:
    - Fixed incorrect calls to bli_get_range_*() from within trsm blocked
      variants 1f, 2b, and 2f. The bug somehow went undetected since the
      big commit (537a1f4), and, strangely, did not manifest via the BLIS
      testsuite. The bug finally came to our attention when running thei
      libflame test suite while linking to BLIS. Thanks to Kiran Varaganti
      for submitting the initial report that led to this bug.

commit 9cfa33023f123a6c17e987f72fba174ce073f0b6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed May 11 16:02:30 2016 -0500

    Minor updates to bli_f2c.h.
    
    Details:
    - Added #undef guards to certain #define statements in bli_f2c.h,
      and renamed the file guard to BLIS_F2C_H. This helps when
      #including "blis.h" from an application or library that already
      #includes an "f2c.h" header.

commit a09a2e23eacf5328858c8318bb637c5ff3b71d08
Merge: 4dcd37eb 7c604e1c
Author: Tyler Michael Smith <tms@cs.utexas.edu>
Date:   Wed May 11 10:47:11 2016 -0500

    Merge pull request #76 from devinamatthews/move_simd_defs
    
    Move default SIMD-related definitions to bli_kernel_macro_defs.h

commit 4dcd37eb1b12a6e08cc13df7b61391ef8363f5d8
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Tue May 10 16:28:59 2016 -0500

    fixing knc simd align size

commit 619dee0daec3474b4e5a55df90a61aabcae194f2
Merge: b790b3d9 7c604e1c
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue May 10 12:13:24 2016 -0500

    Merge branch 'move_simd_defs' into knl

commit 7c604e1cbc1609b6e12d3ee973c08b7af5035be4
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue May 10 12:11:55 2016 -0500

    Move default SIMD-related definitions to bli_kernel_macro_defs.h. Otherwise, configurations which customize these fail as these are now defined in bli_kernel.h.

commit b790b3d9e1820f3b691676de48c291cae083452d
Merge: 4f8c05c9 a7be2d28
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue May 10 11:49:47 2016 -0500

    Merge branch 'master' into knl

commit a7be2d28e8930b154d0da1d6929b54a96e210af6
Merge: 97b512ef 4b1e55ed
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 10 11:48:51 2016 -0500

    Merge pull request #74 from devinamatthews/fix_common_symbols
    
    Default-initialize all extern global variables to avoid generating common symbols.

commit 4b1e55edbfe0e1cb2e7b9428424903497cb7a841
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue May 10 10:08:47 2016 -0500

    Default-initialize all extern global variables to avoid generating common symbols. Fixes #73.

commit 97b512ef62c7e25c97ed5e9eca81cd7015b2ac91
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri May 6 10:24:30 2016 -0500

    Include headers from cblas.h to pull in f77_int.
    
    Details:
    - Added #include statements for certain key BLIS headers so that the
      definition of f77_int is pulled in when a user compiles application
      code with only #include "cblas.h" (and no other BLIS header). This
      is necessary since f77_int is now used within the cblas API.

commit c3a4d39d03665135f1616588b5ef7c3e9ef5688d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed May 4 17:22:56 2016 -0500

    Updates to haswell gemm micro-kernels.
    
    Details:
    - Added two new sets of [sd]gemm micro-kernels for haswell architectures,
      one that is 4x24/4x12 (s and d) and one that is 6x16/6x8.
    - Changed the haswell configuration to use the 6x16/6x8 micro-kernels
      by default.
    - Updated various Makefiles, in test, test/3m4m, and testsuite.

commit 0b01d355ae861754ae2da6c9a545474af010f02e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Apr 27 15:21:10 2016 -0500

    Miscellaneous cleanups, fixes to recent commits.
    
    Details:
    - Fixed a typo in bli_l1f_ref.h, introduced into bbb8569, that only
      manifested when non-reference level-1f kernels were used.
    - Added an #undef BLIS_SIMD_ALIGN_SIZE to bli_kernel.h of dunnington
      configuration to prevent a compile-time warning until I can figure out
      the proper permanent fix.
    - Moved frame/1f/kernels/bli_dotxaxpyf_ref_var1.c out of the compilation
      path (into 'other' directory). _ref_var2 is used by default, which is
      the variant that is built on axpyf and dotxf instead of dotaxpyv.
    - Removed section of frame/include/bli_config_macro_defs.h pertaining to
      mixed datatype support.

commit ed7326c836f427e2f8420b015220ce293207b10c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Apr 27 14:57:40 2016 -0500

    Added 'restrict' to l1v/l1f code in 'kernels' dir.
    
    Details:
    - Added 'restrict' keyword to existing kernel definitions in 'kernels'
      directory. These changes were meant for inclusion in bbb8569.

commit bbb8569b2a08c3bcd631d5a05eb389d01d94ac07
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Apr 27 14:13:46 2016 -0500

    Use 'restrict' in all kernel APIs; wspace changes.
    
    Details:
    - Updated level-1v, level-1f kernel function types (bli_l1?_ft.h) and
      generic kernel prototypes (bli_l1?_ker.h) to use 'restrict' for all
      numerical operand pointers (ie: all pointers except the cntx_t).
    - Updated level-1f reference kernel definitions to use 'restrict' for
      all numerical operand pointers. (Level-1v reference kernel definitions
      were already updated in bdbda6e.)
    - Rewrote the level-1v and level-1f reference kernel prototypes in
      bli_l1v_ref.h and bli_l1f_ref.h, respectively, to simply #include
      bli_l1v_ker.h and bli_l1f_ker.h with redefined function base names
      (as was already being done for the level-3 micro-kernel prototypes
      in bli_l3_ref.h), rather than duplicate the signatures from the
      _ker.h files.
    - Added definitions to frame/include/bli_kernel_prototypes.h for axpbyv
      and xpbyv, which were probably meant for inclusion in bdbda6e.
    - Converted a number of instances of four spaces, as introduced in
      bdbda6e, to tabs.

commit 4ea419c72c789825e1f93a1eee88219bbf873930
Merge: f1e9be2a bdbda6e6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 26 12:50:45 2016 -0500

    Merge pull request #70 from devinamatthews/daxpby
    
    Give the level1v operations some love

commit bdbda6e6acc682ab1b6ca680edebd09ae12a832c
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Apr 25 11:05:57 2016 -0500

    Give the level1v operations some love:
    
    - Add missing axpby and xpby operations (plus test cases).
    - Add special case for scal2v with alpha=1.
    - Add restrict qualifiers.
    - Add special-case algorithms for incx=incy=1.

commit f1e9be2aba1a057eedb947bbae96848597777408
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Apr 22 15:34:02 2016 -0500

    Minor tweak to test/Makefile.
    
    Details:
    - Just committing a minor change to test/Makefile that has been lingering
      in my local working copy for longer than I can remember.

commit aa0bceec277938328dabeb744680623f24fb0b61
Merge: 4136553f e2784b4c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Apr 22 12:01:31 2016 -0500

    Merge branch 'master' of github.com:flame/blis

commit 4136553f0d0661a668dfdb9edcd7ce1c5773dde7
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Apr 22 11:53:53 2016 -0500

    Clear level-3 cntx_t's via memset() before use.
    
    Details:
    - In all level-3 operations' _cntx_init() functions, replaced calls to
      bli_cntx_obj_init() with calls to bli_cntx_obj_clear(), and in all
      level-3 operations' _cntx_finalize() functions, removed calls to
      bli_cntx_obj_finalize(), leaving those function definitions empty.
    - Changed the definition of bli_cntx_obj_clear() so that the clearing
      occurs via a single call to memset().

commit 4f8c05c9e2ef4cbb82b35a3ebf1f0a0ac665830e
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Thu Apr 21 10:00:59 2016 -0500

    Rearrange KNL dgemm kernel again to streamline usage of ymm register. sgemm and dgemm now both working with Intel SDE.

commit e2784b4c921f706e756df3e146e20a4cb63f53e3
Merge: dd0ab1d9 a9b6c3ab
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Apr 20 18:34:09 2016 -0500

    Merge pull request #67 from devinamatthews/cblas-f77-int
    
    Change CBLAS integer type to f77_int

commit a9b6c3abda6222a8b240361643932e83cf726c4f
Merge: e4c54c81 dd0ab1d9
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Apr 20 16:00:10 2016 -0500

    Merge remote-tracking branch 'origin/master' into cblas-f77-int
    
    # Conflicts:
    #       config/haswell/bli_config.h

commit e4c54c81463c2a19c9bb6b1f0f1be3fa9d018a45
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Apr 20 15:56:46 2016 -0500

    Change integer type in CBLAS function signatures to f77_int, and add proper const-correctness to BLAS layer.

commit dd0ab1d93f33abca6af9edd7b8e52da62dcfa5b1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Apr 20 14:38:23 2016 -0500

    Converted some bli_cntx query functions to macros.
    
    Details:
    - Commented out several datatype-aware query functions (those ending in
      _dt) from bli_cntx.c, as well as their prototypes in bli_cntx.h, and
      added equivalent cpp query macros to bli_cntx.h.
    - Added 'bli_config.h' to .gitignore.

commit 7193230f7d35edbd1d2f77842a613971f1603463
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Apr 20 09:37:30 2016 -0500

    Work around missing VPMULLQ on KNL.

commit a30ccbc4c6a6e6460e78af6b5c530ee0d06f98fb
Merge: eb2f18e4 0e1a9821
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 19 15:04:33 2016 -0500

    Merge pull request #66 from devinamatthews/blas-configure
    
    Add configure options and generate bli_config.h automatically.

commit bd44cf13e886069bc66c10ac0db178be96629a0d
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue Apr 19 13:43:04 2016 -0500

    Fix copy-paste errors in KNL kernels.

commit eb2f18e4844d985715df20798f50f9cc12e3b5ad
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 19 12:50:32 2016 -0500

    More compile-time fixes to bgq gemm ukernel code.

commit 0e1a9821d860f6c1d818baf4c48d21a23726c132
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue Apr 19 11:44:37 2016 -0500

    Add configure options and generate bli_config.h automatically.
    
    Options to configure have been added for:
    - Setting the internal BLIS and BLAS/CBLAS integer sizes.
    - Enabling and disabling the BLAS and CBLAS layers.
    
    Additionally, configure options which require defining macros (the above plus the threading model), write their macros to the automatically-generated bli_config.h file in the top-level build directory. The old bli_config.h files in the config dirs were removed, and any kernel-related macros (SIMD size and alignment etc.) were moved to bli_kernel.h. The Makefiles were also modified to find the new bli_config.h file.
    
    Lastly, support for OMP in clang has been added (closes #56).

commit a11eec05928ddc5c43fa5dbcd35f2edd24ff35a1
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Apr 18 13:13:36 2016 -0500

    Add sgemm ukernels for KNL. vpmullq is not implemented on KNL -- needs workaround.

commit ff84469a4575f1ef8a0010046fde52240a312cae
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 18 12:29:09 2016 -0500

    Applied various compilation fixes to bgq kernels.

commit c38e0dab05b2dc36672eab96e1248fb7fb2d785b
Merge: bd5e2296 cbcd0b73
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Apr 18 10:21:35 2016 -0500

    Merge remote-tracking branch 'origin/master' into knl

commit bd5e2296e98e042c31f1e8ece2c1ca8e4bdc2d4c
Merge: 4745def0 49f85177
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Apr 18 10:15:22 2016 -0500

    Merge remote-tracking branch 'origin/knl' into knl

commit 4745def0c87377ae83ad73ac514d7de08a96b2ac
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Apr 18 10:15:05 2016 -0500

    Add 64-bit offset vector so we can use vgatherqpd.

commit 49f85177f886f38889b60503a4e12fa7f04be1fd
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Apr 18 10:14:11 2016 -0500

    KNL ukernel compiles with gcc.

commit cbcd0b739dc54bd14fbb46aeda267c26725cd70f
Author: Tyler Michael Smith <tms@cs.utexas.edu>
Date:   Mon Apr 18 03:12:57 2016 -0500

    Changing ifdef for OSX pthread barriers

commit 58b2c3cf040134d1be913c585a3c6905629116c0
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Sat Apr 16 16:12:24 2016 -0500

     Rewrite of KNL kernel in GNU extended asm syntax.

commit dd62080cea78f3a23616200d6640e52c102b2bb9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Apr 15 11:15:41 2016 -0500

    Compile-time fix to bgq l1f kernels.
    
    Details:
    - Fixed an old reference to bli_daxpyf_fusefac, which no longer exists,
      by replacing it with the axpyf fusing factor (8), and cleaned up the
      relevant section of config/bgq/bli_kernel.h.
    - Removed most of the details of the level-3 kernels from the template
      kernel code in config/template/kernels/3 and replaced it with a
      reference to the relevant kernel wiki maintained on the BLIS github
      website.

commit d5a915dd8d7a6ead42a68772e4420eb3647e6f1a
Merge: 4320b725 41694675
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 14 12:56:36 2016 -0500

    Merge branch 'master' of github.com:flame/blis

commit 4320b725a1f8fd34101470b6cf52ad504a79c517
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 14 12:51:29 2016 -0500

    Use kernel CFLAGS on "ukernels" directories.
    
    Details:
    - Updated the top-level Makefile so that the CFLAGS variable designated
      for kernel source code is applied not only to source code in
      directories named "kernels" but source code in any directory that
      contains the substring "kernels", such as "ukernels".
    - Formally disabled some code in gen-make-frag.sh script that was already
      effectively disabled. The code was related to handling "noopt" and
      "kernel" directories, which is now handled independently within the
      top-level Makefile without needing to place these source files into
      a spearate makefile variable.

commit 41694675e4cb56e2e0323c7a7db48e0819606a31
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed Apr 13 15:51:08 2016 -0500

    pthreads bugfixes
    
    Getting pthreads to work on my Mac
    Implemented a pthread barrier when _POSIX_BARRIER isn't defined
    Now spawn n-1 threads instead of n threads so that master thread isn't just spinning the whole time
    Add -lpthread instead of -pthread to LDFLAGS (for clang)

commit f756dbfa0d542cbc497724981520c83abf049c4b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Apr 13 11:25:33 2016 -0500

    Removed stale #include from bgq configuration.
    
    Details:
    - Removed an old #include statement ("bli_gemm_8x8.h") from the
      bli_kernel.h file in the bgq configuration. It turns out this
      file was no longer needed even prior to 537a1f4.

commit 0bd4169ea75f690714e7d2912229932a75d8a7e2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 11 18:08:32 2016 -0500

    Fixed context-broken dunnington/penryn kernels.
    
    Details:
    - Added missing context parameters to several instances where simpler
      kernels, or reference kernels, are called instead of executing the
      main body code contained in the kernel function in question.
    - Renamed axpyv and dotv kernel files to use "opt" instead of "int"
      substring, for consistency with level-1f kernels.

commit 7912af5db45b7372d19a9a3dfeb82df302a05628
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 11 17:32:13 2016 -0500

    CHANGELOG update (0.2.0)

commit 898614a555ea0aa7de4ca07bb3cb8f5708b6a002 (tag: 0.2.0)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 11 17:32:09 2016 -0500

    Version file update (0.2.0)

commit 537a1f4f85ce1aa008901857cb3182e6b4546d7f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 11 17:21:28 2016 -0500

    Implemented runtime contexts and reorganized code.
    
    Details:
    - Retrofitted a new data structure, known as a context, into virtually
      all internal APIs for computational operations in BLIS. The structure
      is now present within the type-aware APIs, as well as many supporting
      utility functions that require information stored in the context. User-
      level object APIs were unaffected and continue to be "context-free,"
      however, these APIs were duplicated/mirrored so that "context-aware"
      APIs now also exist, differentiated with an "_ex" suffix (for "expert").
      These new context-aware object APIs (along with the lower-level, type-
      aware, BLAS-like APIs) contain the the address of a context as a last
      parameter, after all other operands. Contexts, or specifically, cntx_t
      object pointers, are passed all the way down the function stack into
      the kernels and allow the code at any level to query information about
      the runtime, such as kernel addresses and blocksizes, in a thread-
      friendly manner--that is, one that allows thread-safety, even if the
      original source of the information stored in the context changes at
      run-time; see next bullet for more on this "original source" of info).
      (Special thanks go to Lee Killough for suggesting the use of this kind
      of data structure in discussions that transpired during the early
      planning stages of BLIS, and also for suggesting such a perfectly
      appropriate name.)
    - Added a new API, in frame/base/bli_gks.c, to define a "global kernel
      structure" (gks). This data structure and API will allow the caller to
      initialize a context with the kernel addresses, blocksizes, and other
      information associated with the currently active kernel configuration.
      The currently active kernel configuration within the gks cannot be
      changed (for now), and is initialized with the traditional cpp macros
      that define kernel function names, blocksizes, and the like. However,
      in the future, the gks API will be expanded to allow runtime management
      of kernels and runtime parameters. The most obvious application of this
      new infrastructure is the runtime detection of hardware (and the
      implied selection of appropriate kernels). With contexts in place,
      kernels may even be "hot swapped" at runtime within the gks. Once
      execution enters a level-3 _front() function, the memory allocator will
      be reinitialized on-the-fly, if necessary, to accommodate the new
      kernels' blocksizes. If another application thread is executing with
      another (previously loaded) kernel, it will finish in a deterministic
      fashion because its kernel information was loaded into its context
      before computation began, and also because the blocks it checked out
      from the internal memory pools will be unaffected by the newer threads'
      reinitialization of the allocator.
    - Reorganized and streamlined the 'ind' directory, which contains much of
      the code enabling use of induced methods for complex domain matrix
      multiplication; deprecated bli_bsv_query.c and bli_ukr_query.c, as
      those APIs' functionality is now mostly subsumed within the global
      kernel structure.
    - Updated bli_pool.c to define a new function, bli_pool_reinit_if(),
      that will reinitialize a memory pool if the necessary pool block size
      has increased.
    - Updated bli_mem.c to use bli_pool_reinit_if() instead of
      bli_pool_reinit() in the definition of bli_mem_pool_init(), and placed
      usage of contexts where appropriate to communicate cache and register
      blocksizes to bli_mem_compute_pool_block_sizes().
    - Simplified control trees now that much of the information resides in
      the context and/or the global kernel structure:
      - Removed blocksize object pointers (blksz_t*) fields from all control
        tree node definitions and replaced them with blocksize id (bszid_t)
        values instead, which may be passed into a context query routine in
        order to extract the corresponding blocksize from the given context.
      - Removed micro-kernel function pointers (func_t*) fields from all
        control tree node definitions. Now, any code that needs these function
        pointers can query them from the local context, as identified by a
        level-3 micro-kernel id (l3ukr_t), level-1f kernel id, (l1fkr_t), or
        level-1v kernel id (l1vkr_t).
      - Removed blksz_t object creation and initialization, as well as kernel
        function object creation and initialization, from all operation-
        specific control tree initialization files (bli_*_cntl.c), since this
        information will now live in the gks and, secondarily, in the context.
    - Removed blocksize multiples from blksz_t objects. Now, we track
      blocksize multiples for each blocksize id (bszid_t) in the context
      object.
    - Removed the bool_t's that were required when a func_t was initialized.
      These bools are meant to allow one to track the micro-kernel's storage
      preferences (by rows or columns). This preference is now tracked
      separately within the gks and contexts.
    - Merged and reorganized many separate-but-related functions into single
      files. This reorganization affects frame/0, 1, 1d, 1m, 1f, 2, 3, and
      util directories, but has the most obvious effect of allowing BLIS
      to compile noticeably faster.
    - Reorganized execution paths for level-1v, -1d, -1m, and -2 operations
      in an attempt to reduce overhead for memory-bound operations. This
      includes removal of default use of object-based variants for level-2
      operations. Now, by default, level-2 operations will directly call a
      low-level (non-object based) loop over a level-1v or -1f kernel.
    - Converted many common query functions in blk_blksz.c (renamed from
      bli_blocksize.c) and bli_func.c into cpp macros, now defined in their
      respective header files.
    - Defined bli_mbool.c API to create and query "multi-bools", or
      heterogeneous bool_t's (one for each floating-point datatype), in the
      same spirit as blksz_t and func_t.
    - Introduced two key parameters of the hardware: BLIS_SIMD_NUM_REGISTERS
      and BLIS_SIMD_SIZE. These values are needed in order to compute a third
      new parameter, which may be set indirectly via the aforementioned
      macros or directly: BLIS_STACK_BUF_MAX_SIZE. This value is used to
      statically allocate memory in macro-kernels and the induced methods'
      virtual kernels to be used as temporary space to hold a single
      micro-tile. These values are now output by the testsuite. The default
      value of BLIS_STACK_BUF_MAX_SIZE is computed as
      "2 * BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE".
    - Cleaned up top-level 'kernels' directory (for example, renaming the
      embarrassingly misleading "avx" and "avx2" directories to "sandybridge"
      and "haswell," respectively, and gave more consistent and meaningful
      names to many kernel files (as well as updating their interfaces to
      conform to the new context-aware kernel APIs).
    - Updated the testsuite to query blocksizes from a locally-initialized
      context for test modules that need those values: axpyf, dotxf,
      dotxaxpyf, gemm_ukr, gemmtrsm_ukr, and trsm_ukr.
    - Reformatted many function signatures into a standard format that will
      more easily facilitate future API-wide changes.
    - Updated many "mxn" level-0 macros (ie: those used to inline double loops
      for level-1m-like operations on small matrices) in frame/include/level0
      to use more obscure local variable names in an effort to avoid variable
      shaddowing. (Thanks to Devin Matthews for pointing these gcc warnings,
      which are only output using -Wshadow.)
    - Added a conj argument to setm, so that its interface now mirrors that
      of scalm. The semantic meaning of the conj argument is to optionally
      allow implicit conjugation of the scalar prior to being populated into
      the object.
    - Deprecated all type-aware mixed domain and mixed precision APIs. Note
      that this does not preclude supporting mixed types via the object APIs,
      where it produces absolutely zero API code bloat.

commit dd856c2cb75a2221a503a73dde27790c34b91570
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Apr 11 10:39:18 2016 -0500

    Translated MIC kernel to KNL and cleaned up a bit. Only real change is lack of swizzle modifiers for FMA instructions (used bcast from memory instead).

commit 7f27431d3fffdda99c282ec412731d0a90cb32a7
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Apr 8 10:04:39 2016 -0500

    Copy mic kernel to knl for transliteration.

commit f8f02f0334ac020021e15a415bcd33aeea01deb4
Merge: 32c92d94 d1f8e5d9
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Apr 6 11:37:05 2016 -0500

    Merge branch 'master' into const_correctness

commit 32c92d945c55708da0eb63be1771f8c5430e3910
Merge: 62914ccb 20af937b
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Apr 6 11:36:02 2016 -0500

    Merge branch 'master' into const_correctness

commit d1f8e5d9b2ecd054ed103f4d642d748db2d4f173
Merge: 20af937b c11d28ee
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 5 12:21:27 2016 -0500

    Merge pull request #60 from esauvage/master
    
    sgemm µkernel for bulldozer : bug correction for k%4 != 0

commit c11d28eed89d65494bc4019f04d046520866c0ff
Author: Etienne Sauvage <etienne.sauvage@gmail.com>
Date:   Sat Apr 2 21:15:48 2016 +0200

    cgemm µkernel for bulldozer : bug correction for k%4 != 0

commit 20af937b57f82bb3acb09418d5c0206e1b24f2c7
Merge: 36c3abb0 fc61a114
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Mar 31 14:37:30 2016 -0500

    Merge pull request #59 from devinamatthews/fix_testsuite_makefile
    
    Fix testsuite makefile

commit fc61a1143edeba4946d4b9915f1775bb08e643fc
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Thu Mar 31 10:53:01 2016 -0500

    Fix formatting in configure.

commit 26379b14de630e3a6c6eef5dfe87ff001558a8a6
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Thu Mar 31 10:45:48 2016 -0500

    Adjust paths in common.mk to support building from testsuite dir.

commit 36c3abb05fecb02d4a9ab13b2b69d133adf34583
Merge: 64b41fa5 917ce754
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Mar 31 10:26:17 2016 -0500

    Merge pull request #58 from esauvage/master
    
    cgemm & zgemm micro-kernels for FMA4 instruction set (bulldozer confi…

commit 356d854fc9e34642cc46e0e02a8ceb56114878af
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Mar 30 16:33:15 2016 -0500

    Make symlink to common.mk in build directory.

commit edbb8470044f82ef959583ee09613a5a985292b5
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Mar 30 16:27:11 2016 -0500

    Refactor out some definitions which moved from make_defs.mk to Makefile for use in testsuite Makefile.

commit 917ce75482a543fef46553efff6c246939761e59
Author: Etienne Sauvage <etienne.sauvage@gmail.com>
Date:   Wed Mar 30 22:03:09 2016 +0200

    cgemm & zgemm micro-kernels for FMA4 instruction set (bulldozer configuration), based on x86_64/avx micro-kernel

commit 62914ccbcdb3c594f065dcfa65bd7e7b95c79283
Merge: bbf704bf 64b41fa5
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue Mar 29 15:24:25 2016 -0500

    Merge branch 'master' into const_correctness

commit 64b41fa554dff44b2f9ad48901b67c63836407a8
Merge: 1b09e343 0171ad58
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 29 15:19:41 2016 -0500

    Merge pull request #54 from devinamatthews/more_config_opts
    
    More config opts

commit 1b09e343dfe5b48b4842e2cb96f41c8cc249bad0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 29 12:55:28 2016 -0500

    Updated gcc version from 4.8 to 4.9 in .travis.yml.

commit 0171ad58997b3a5a9b76301511dbe0751fffc940
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Mar 28 13:55:06 2016 -0500

    Add icc and clang support for Intel architectures, fixes #47. 2bd036f fixes #49 BTW.

commit 3090fff64cc87ff2519a09f38e6b8699cf3cba11
Merge: 8624e365 4ca5d5b1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Mar 28 12:36:25 2016 -0500

    Merge pull request #44 from esauvage/master
    
    sgemm micro-kernel for FMA4 instruction set

commit e6e566426ac3ded7ef87cd8ff9be98accfdc4acc
Merge: 469429ec 8624e365
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Sat Mar 26 14:10:15 2016 -0500

    Merge branch 'master' into more_config_opts

commit 8624e36543160739d954c4dbcc5a5594458f3a12
Merge: a315833f 2bd036f1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Mar 26 13:56:28 2016 -0500

    Merge pull request #50 from devinamatthews/fix_noopt_avx
    
    Fix configuration issue where instruction set flags are not specified for debug builds.

commit 469429ec34e5b1a172ce35596f9c7afdaacac131
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Mar 25 20:45:41 2016 -0500

     Fix LD_FLAGS -> LDFLAGS.

commit 8442d65c9ead0376fc5f2dfad62fd4862ab9b2b3
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Mar 25 20:06:48 2016 -0500

    Replace -march=native with specific architecture flags to support cross-compiling, and add icc support for Intel architectures.

commit 76099f20be1b49ac960f7e3c5a8296bbf4e1782d
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Mar 25 17:22:58 2016 -0500

    Add threading option to configure.

commit ad43eab4c7899d56d8d7caa6e2d92bc0581ea5a5
Merge: 9452bdb3 2bd036f1
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Mar 25 15:00:02 2016 -0500

    Merge branch 'fix_noopt_avx' into more_config_opts

commit 9452bdb3afbf2d7f898134a091d7790817e7be9c
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Mar 25 14:59:50 2016 -0500

    Add options for verbose make output and static/shared linking to configure.

commit 2bd036f1f9ce1ee0864365557f66d9415dd42de3
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Mar 25 12:16:49 2016 -0500

    Fix configuration issue where instruction set flags are not specified for debug builds.

commit bbf704bf7501411964a63a68f1af541f612cf92d
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Mar 25 09:55:35 2016 -0500

     Add missing const to bli_read_nway_from_env.

commit a315833f067944fb0bc14cf60f0c7dcb5dc897b6
Merge: 1d1a426d af92773f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Mar 24 12:30:21 2016 -0500

    Merge pull request #48 from figual/master
    
    Updated and improved ARMv8 micro-kernels.

commit af92773f4f85a2441fe0c6e3a52c31b07253d08e
Author: figual <figual@ucm.es>
Date:   Wed Mar 23 22:07:02 2016 +0100

    Updated and improved ARMv8 micro-kernels.

commit a4d7729776d17d9bdf2341eacd70b9770b9ba8d2
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Mar 21 09:55:21 2016 -0500

    Set default value for debug_type variable.

commit 0e2447fa55d8c5fa2b1fc4150073512495c5f9eb
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Thu Mar 17 16:32:05 2016 -0500

    Add const correctness to auxinfo_t struct (microkernels need update theoretically).

commit 1d1a426d18ec03754021456862a1f4d1dfec1fbf
Merge: 5a978fff d226dfa0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Mar 7 15:17:53 2016 -0600

    Merge pull request #46 from devinamatthews/new-config-opts
    
    Add several changes to the build system.

commit d226dfa05190eb477b33563b1edccf8603973336
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Sat Mar 5 16:18:14 2016 -0600

    Add several changes to the build system.
    
    1) Add -- options.
    2) Add -d/--enable-debug option to enable debugging symbols with and without optimization.
    3) Allow user to specify CC at configure time, and determine vendor (gcc/icc/etc.). For now configurations enforce a particular vendor.
    4) Add make V=[0,1] option to control build verbosity.

commit 5a978fffdb8f09a81c89541d541d4a6830cd70a4
Merge: adb2b4e0 63e26423
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Mar 4 17:26:58 2016 -0600

    Merge pull request #45 from devinamatthews/high_prec_timers
    
    Use clock_gettime(CLOCK_MONOTONIC) and mach_absolute_time instead of gettimeofday

commit 63e264239053b913164a849dd8a45829087eaddc
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Mar 4 13:17:50 2016 -0600

    Make sure that -lrt is linked on Linux.

commit 44fddd48dc1708a956803d1948f04429ec0d8700
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Mar 4 12:36:38 2016 -0600

    Add missing \.

commit 7cabd2131f953de23e7015d760b0ddfda51b1251
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Thu Mar 3 11:43:07 2016 -0600

    Use clock_gettime(CLOCK_MONOTONIC) and mach_absolute_time instead of gettimeofday.

commit adb2b4e096c78e8b2f85fd372cf0d5eb04af5be8
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed Mar 2 14:48:12 2016 -0600

    Fixing guard for non implemented partitioning through packed matrices

commit 4ca5d5b1fd6f2e4a8b2e139c5405475239581e51
Author: Etienne Sauvage <etienne.sauvage@gmail.com>
Date:   Tue Mar 1 21:33:01 2016 +0100

    sgemm micro-kernel for FMA4 instruction set (bulldozer configuration), based on x86_64/avx micro-kernel

commit 627d59b5ba06866b26f46e4434a0435b600925e3
Author: Etienne Sauvage <etienne.sauvage@gmail.com>
Date:   Mon Feb 29 21:53:12 2016 +0100

    symbolic link for bulldozer configuration to kernels

commit 2dc5c0ae038ed175fab85751803ada05734d1ba1
Merge: f2809fc5 3d0fae81
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Feb 29 12:22:51 2016 -0600

    Merge pull request #40 from tkelman/bulldozer-symlink
    
    Add symlink from config/bulldozer/kernels to kernels/x86_64/bulldozer

commit f2809fc5f74466c755da6a5b4632853e634060b5
Merge: f86b94f2 8624a33c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Feb 27 13:06:03 2016 -0600

    Merge pull request #39 from devinamatthews/fix_f2c_conflicts
    
    Devin's f2c type namespace update.
    
    Details:
    - Added "bla_" prefix to f2c type names to prevent conflicts with external user code.
    - Removed most of the body of bli_f2c.h, which was unused.

commit 3d0fae810d942085d8f2d389820b4e0027577db8
Author: Tony Kelman <tony@kelman.net>
Date:   Thu Feb 25 23:24:03 2016 -0800

    Add symlink from config/bulldozer/kernels to kernels/x86_64/bulldozer
    
    to fix linking issue mentioned in #37 and https://groups.google.com/forum/#!topic/blis-devel/iypwljcaeEI

commit 8624a33ccc12dff6f6c4f92992ca5636af1576a6
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Thu Feb 25 13:51:26 2016 -0600

    Fix remaining f2c conflicts.

commit 372eef0b6c0a535bf88d4b46b72f61266e8491ba
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Thu Feb 25 12:01:58 2016 -0600

     Fixed most conflicts after hack-n-slash ofr bli_f2c.h, cleanup in
    progress.

commit f86b94f206e2e09fa3221cc55c3dc5b05ca4775a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 23 18:12:34 2016 -0600

    Included missing blas2blis integer def to CBLAS.
    
    Details:
    - Added #include "bli_config_macro_defs" to all cblas_*.c files in
      compat/cblas/src. This has the effect of defining
      BLIS_BLAS2BLIS_INT_TYPE_SIZE to the default value if bli_config.h does
      not define it. Thanks to Tony Kelman for reporting this bug.
    - In cblas_i?amax.c, changed the type of the variable 'iamax' from 'int'
      to 'f77_int'. This eliminates a compiler warning and a potential
      runtime bug and/or crash when the size of an int differs from the size
      of f77_int (as determined by BLIS_BLAS2BLIS_INT_TYPE_SIZE).

commit 0b126de1342c11c65623bcb38e258e21e9244e3d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Nov 13 16:29:12 2015 -0600

    Consolidated packm_blk_var1 and packm_blk_var2.
    
    Details:
    - Consolidated the two blocked variants for packm into a single
      implementation (packm_blk_var1) and removed the other variant.
    - Updated all induced method _cntl_init() functions in frame/cntl/ind/
      to use the new blocked variant 1.
    - Defined two new macros, bli_is_ind_packed() and bli_is_nat_packed(),
      to detect pack_t schemas for induced methods and native execution,
      respectively.

commit 30e5eb29e060b97752f702d2ea5d101d950f53b2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Nov 13 12:14:19 2015 -0600

    Minor changes to treatment of rs, cs in bli_obj.c.
    
    Details:
    - Applied a patch submitted by Devin Matthews that:
      - implements subtle changes to handling of somewhat unusual cases of
        row and column strides to accommodate certail tensor cases, which
        includes adding dimension parameters to _is_col_tilted() and
        _is_row_tilted() macros,
      - simplifies how buffers are sized when requested BLIS-allocated
        objects,
      - re-consolidates bli_adjust_strides_*() into one function, and
      - defines 'restrict' keyword as a "nothing" macro for C++ and pre-C99
        environments.

commit f0a4f41b5acf55b41707ec821c4c5f9076dfbc24
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Nov 12 15:22:50 2015 -0600

    Fixed unimplemented case in core2 sgemm ukernel.
    
    Details:
    - Implemented the "beta == 0" case for general stride output for the
      dunnington sgemm micro-kernel. This case had been, up until now,
      identical to the "beta != 0" case, which does not work when the
      output matrix has nan's and inf's. It had manifested as nan residuals
      in the test suite for right-side tests of ctrsm4m1a. Thanks to Devin
      Matthews for reporting this bug.

commit 42810bbfa0b8f006ecc5128d903909ec13ea63f9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Nov 12 12:07:46 2015 -0600

    Fixed minor bugs for uncommon obj_create cases.
    
    Details:
    - Separated bli_adjust_strides() into _alloc() and _attach() flavors so
      that the latter can avoid a test performed by the former, in which the
      rs and cs are overridden and set to zero if either matrix dimension is
      zero. Actually, we also disable this overridding behavior, even for the
      _alloc() case, since keeping the original strides (probably) does not
      hurt anything. The original code has been kept commented-out, though,
      in case an unintended consequence is later discovered.
    - Fixed a typo in an error check for general stride cases where rs == cs.

commit 3e6dd11467643fbc2cb45c13cec8dd6024232833
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Nov 3 10:30:08 2015 -0600

    Minor re-expression in quadratic partitioning code.
    
    Details:
    - Minor change to quadratic equation solution code that avoids
      recomputation of the sqrt() parameter when the compiler is not
      smart enough to perform this optimization automatically.

commit 0694b722f7e4df00efb32639095a2aca80e67f52
Merge: 3e116f0a 33557ecc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 2 17:24:25 2015 -0600

    Merge branch 'master' of github.com:flame/blis

commit 3e116f0a2953f50b3c068759a775ad7ffae04e49
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 2 17:18:23 2015 -0600

    Fixed imaginary bug in quadratic partitioning code.
    
    Details:
    - Fixed a bug in the relatively new quadratic partitioning code that,
      under the right conditions, would perform sqrt() on a negative value.
      If the solution is imaginary, we discard it and use an alternate
      partition width that assumes no diagonal intersection. That alternate
      width is actually already computed, so, the fix was quite simple.
      Thanks to Devangi Parikh for reporting this bug.

commit 33557ecccaf49b2569b7f3d7bcea52c2aab94c68
Author: Jeff Hammond <jeff.science@gmail.com>
Date:   Mon Nov 2 12:18:43 2015 -0800

    add Travis CI build status icon to the README

commit 4a502fbe77bd0f701108baaa559d9cfb483f88de
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 2 13:28:34 2015 -0600

    Laid groundwork for runtime memory pool resizing.
    
    Details:
    - Changed bli_pool_finalize() so that the freeing begins with the block
      at top_index instead of block 0. This allows us to use the function
      for terminal finalization as well as temporary cleanup prior to
      reinitialization. Also, clear the pool_t struct upon _pool_finalize()
      in case it is called in the terminal case with some blocks still
      checked out to threads (in which case the threads will see the new
      block size as 0 and thus release the block as intended).
    - Added bli_pool_reinit(), which calls _pool_finalize() followed by
      _pool_init() with new parameters.
    - Added bli_mem_reinit(), which is based on bli_pool_reinit().
    - Added new wrapper, _mem_compute_pool_block_sizes(), which calls
      _mem_compute_pool_block_sizes_dt().
    - Updated bli_mem_release() so that the pblk_t is freed, via
      _pool_free_block(), if the block size recorded in the mem_t at the
      time the pblk_t was acquired is now different from the value in the
      pool_t.

commit 37e55ca39bdbddaec03ad30d43e8ad2b3e549c96
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 30 18:25:04 2015 -0500

    Fixed obscure 3m1/4m1a bugs in trmm[3] and trsm.
    
    Details:
    - Fixed a family of bugs in the triangular level-3 operations for
      certain complex implementations (3m1 and 4m1a) that only manifest if
      one of the register blocksizes (PACKMR/PACKNR, actually) is odd:
      - Fixed incorrect imaginary stride computation in bli_packm_blk_var2()
        for the triangular case.
      - Fixed the incorrect computation of imaginary stride, as stored in
        the auxinfo_t struct in trmm and trsm macro-kernels.
      - Fixed incorrect pointer arithmetic in the trsm macro-kernels in the
        cases where the the register blocksize for the triangular matrix is
        odd. Introduced a new byte-granular pointer arithmetic macro,
        bli_ptr_add(), that computes the correct value.
    - Added cpp macro to bli_macro_defs.h for typeof() operator, defined in
      terms of __typeof__, which is used by bli_ptr_add() macro.
    - Disabled the row- vs. column-storage optimization in bli_trmm_front()
      for singleton problems because the inherent ambiguity of whether a
      scalar is row-stored or column-stored causes the wrong parameter
      combination code to be executed (by dumb luck of our checking for
      row storage first).
    - Added commented-out debugging lines to 3m1/4m1a and reference
      micro-kernels, and trsm_ll macro-kernel.

commit 46294d80e5a79c598e200e1c8ec2a642ff839971
Merge: d3159c57 a0a7b85a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Oct 27 12:41:23 2015 -0500

    Merge pull request #35 from figual/master
    
    Fixed incomplete code in the double precision ARMv8 microkernel.

commit a0a7b85ac3e157af53cff8db0e008f4a3f90372c
Author: Francisco Igual <figual@ucm.es>
Date:   Tue Oct 27 08:59:15 2015 +0000

    Fixed incomplete code in the double precision ARMv8 microkernel.

commit d3159c5740c9ee7f8c0b661003aab6f00646ad6f
Merge: b489152e 7e03e45b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 21 14:54:00 2015 -0500

    Merge branch 'master' of github.com:flame/blis

commit b489152e112644ec3b6d19e687231a9607f7694f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 21 14:53:17 2015 -0500

    Use vzeroall in haswell micro-kernels.

commit 7e03e45bfe6c27c4fdbf06b1caa7f49e9a5fef49
Merge: 77ddb0b1 4f88c29f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 14 13:26:07 2015 -0500

    Merge pull request #33 from xianyi/master
    
    Enable Travis CI

commit 4f88c29f9e634cbb6fb22d8c88931f0ec78ad7db
Author: Zhang Xianyi <traits.zhang@gmail.com>
Date:   Wed Oct 14 12:57:50 2015 -0500

    Detect Intel Broadwell (using Haswell config).

commit 4b0ac1a9984a93f7ad4369b10fca63991107d9f5
Merge: fe3e355c 77ddb0b1
Author: Zhang Xianyi <traits.zhang@gmail.com>
Date:   Wed Oct 14 12:51:05 2015 -0500

    Merge branch 'upstream_master'

commit 77ddb0b1d31ada111dadf392766ba6d9210ed9fb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Oct 13 12:53:06 2015 -0500

    Removed flop-counting mechanism.
    
    Details:
    - Removed the optional flop-counting feature introduced in commit
      7574c994.

commit 276da366187460a4c8e6e0910e79cb39ce780bfe
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 12 11:43:03 2015 -0500

    Minor formatting change to README.md.

commit d17057446f5404824478e8a6cd08f242ab75544a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 12 11:39:49 2015 -0500

    Added "Getting Started" section to README.md.
    
    Details:
    - Added section to README.md file containing links to wikis with brief
      descriptions.

commit e7e1f2f7b601b21b50e3cdad8972cb3fe11018d3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 2 16:51:52 2015 -0500

    Minor updates to CREDITS, README files.

commit 55329906ecd7ce1ab910e4d30a29354a9172e7ea
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Sep 26 20:47:19 2015 -0500

    Minor edits to README.md, testsuite.
    
    Details:
    - Fixed typos in README.md.
    - Fixed column heading alignment for testsuite when matlab output is
      enabled.
    - Minor updates to test/3m4m/runme.sh and test/3m4m/Makefile.

commit bbebdb5793a8fd6aaf257012ab0272beaa04a0de
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Sep 25 14:47:27 2015 -0500

    Replaced README with README.md.
    
    Details:
    - Replaced the old (and short) README file with a much more comprehensive
      version written in github-flavored markdown. The new file is based on
      content taken from the old Google Code homepage.

commit e2e9d64a63485461192d9c2a6dd0183a8b71013c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Sep 24 12:14:03 2015 -0500

    Load balance thread ranges for arbitrary diagonals.
    
    Details:
    - Expanded/updated interface for bli_get_range_weighted() and
      bli_get_range() so that the direction of movement is specified in the
      function name (e.g. bli_get_range_l2r(), bli_get_range_weighted_t2b())
      and also so that the object being partitioned is passed instead of an
      uplo parameter. Updated invocations in level-3 blocked variants, as
      appropriate.
    - (Re)implemented bli_get_range_*() and bli_get_range_weighted_*() to
      carefully take into account the location of the diagonal when computing
      ranges so that the area of each subpartition (which, in all present
      level-3 operations, is proportional to the amount of computation
      engendered) is as equal as possible.
    - Added calls to a new class of routines to all non-gemm level-3 blocked
      variants:
        bli_<oper>_prune_unref_mparts_[mnk]()
      where <oper> is herk, trmm, or trsm and [mnk] is chosen based on which
      dimension is being partitioned. These routines call a more basic
      routine, bli_prune_unref_mparts(), to prune unreferenced/unstored
      regions from matrices and simultaneously adjust other matrices which
      share the same dimension accordingly.
    - Simplified herk_blk_var2f, trmm_blk_var1f/b as a result of more the
      new pruning routines.
    - Fixed incorrect blocking factors passed into bli_get_range_*() in
      bli_trsm_blk_var[12][fb].c
    - Added a new test driver in test/thread_ranges that can exercise the new
      bli_get_range_*() and bli_get_range_weighted_*() under a range of
      conditions.
    - Reimplemented m and n fields of obj_t as elements in a "dim"
      array field so that dimensions could be queried via index constant
      (e.g. BLIS_M, BLIS_N). Adjusted/added query and modification
      macros accordingly.
    - Defined mdim_t type to enumerate BLIS_M and BLIS_N indexing values.
    - Added bli_round() macro, which calls C math library function round(),
      and bli_round_to_mult(), which rounds a value to the nearest multiple
      of some other value.
    - Added miscellaneous pruning- and mdim_t-related macros.
    - Renamed bli_obj_row_offset(), bli_obj_col_offset() macros to
      bli_obj_row_off(), bli_obj_col_off().

commit fe3e355c9c5a6f65b8736b009e2d501b62a83ea1
Merge: efa641e3 4dd9dd3e
Author: Zhang Xianyi <traits.zhang@gmail.com>
Date:   Fri Aug 21 14:38:36 2015 -0500

    Merge branch 'upstream_master'

commit efa641e36b73abee34166a252e90e28a6281d92d
Author: Zhang Xianyi <traits.zhang@gmail.com>
Date:   Sat Aug 22 03:15:50 2015 +0800

    Try to fix the compiling bug on travis.

commit 4dd9dd3e1de626b51bfe85d9ee65f193d60e8d38
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Aug 21 11:52:37 2015 -0500

    Fixed minor alignment ambiguity bug in bli_pool.c.
    
    Details:
    - Fixed a typecasting ambiguity in bli_pool_alloc_block() in which
      pointer arithmetic was performed on a void* as if it were a byte
      pointer (such as char*). Some compilers may have already been
      interpreting this situation as intended, despite the sloppiness.
      Thanks to Aleksei Rechinskii for reporting this issue.
    - Redefined pointer alignment macros to typecast to uintptr_t instead of
      siz_t.

commit 12ffd568b04feda57147c13b67717416a01c82f8
Author: Zhang Xianyi <traits.zhang@gmail.com>
Date:   Sat Aug 22 00:24:28 2015 +0800

    Add Travis CI.

commit ecc3ebb749e0861c27deda52b5f87236ede4901b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 29 13:31:12 2015 -0500

    CHANGELOG update (0.1.8)

commit 47caa33485b91ea6f2a5e386e61210c90c5f489f (tag: 0.1.8)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 29 13:31:09 2015 -0500

    Version file update (0.1.8)

commit ef0fbbbdb6148b96938733fce72cb4ed7dad685e
Merge: fdfe14f1 d4b89136
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jul 9 13:54:54 2015 -0500

    Merge branch 'master' of github.com:flame/blis

commit fdfe14f1e17ba5a2f8dfa0bdb799c6b0e730211b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jul 9 13:52:39 2015 -0500

    Added support for Intel Haswell/Broadwell.
    
    Details:
    - Added sgemm and dgemm micro-kernels, which employ 256-bit AVX vectors
      and FMA instructions. (Complex support is currently provided by default
      induced method, 4m1a.)
    - Added a 'haswell' configuration, which uses the aforementioned kernels.
    - Inserted auto-detection support for haswell configuration in
      build/auto-detect/cpuid_x86.c.
    - Modified configure script to explicitly echo when automatic or manual
      configuration is in progress.
    - Changed beta scalar in test_gemm.c module of test suite to -1.0 to 0.9.

commit d4b891369c1eb0879ade662ff896a5b9a7fca207
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 7 10:06:53 2015 -0500

    Added 'carrizo' configuration.
    
    Details:
    - Added a new configuration for AMD Excavator-based hardware also known
      as Carrizo when referring to the entire APU. This configuration uses
      the same micro-kernels as the piledriver, but with different
      cache blocksizes.

commit 0b7255a642d56723f02d7ca1f8f21809967b8515
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jun 19 12:01:50 2015 -0500

    CHANGELOG update (0.1.7)

commit 267253de8a7be546ce87626443ee38701c1d411f (tag: 0.1.7)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jun 19 12:01:49 2015 -0500

    Version file update (0.1.7)

commit 7cd01b71b5e757a6774625b3c9f427f5e7664a76
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jun 19 11:31:53 2015 -0500

    Implemented dynamic allocation for packing buffers.
    
    Details:
    - Replaced the old memory allocator, which was based on statically-
      allocated arrays, with one based on a new internal pool_t type, which,
      combined with a new bli_pool_*() API, provides a new abstract data
      type that implements the same memory pool functionality but with blocks
      from the heap (ie: malloc() or equivalent). Hiding the details of the
      pool in a separate API also allows for a much simpler bli_mem.c family
      of functions.
    - Added a new internal header, bli_config_macro_defs.h, which enables
      sane defaults for the values previously found in bli_config. Those
      values can be overridden by #defining them in bli_config.h the same
      way kernel defaults can be overridden in bli_kernel.h. This file most
      resembles what was previously a typical configuration's bli_config.h.
    - Added a new configuration macro, BLIS_POOL_ADDR_ALIGN_SIZE, which
      defaults to BLIS_PAGE_SIZE, to specify the alignment of individual
      blocks in the memory pool. Also added a corresponding query routine to
      the bli_info API.
    - Deprecated (once again) the micro-panel alignment feature. Upon further
      reflection, it seems that the goal of more predictable L1 cache
      replacement behavior is outweighed by the harm caused by non-contiguous
      micro-panels when k % kc != 0. I honestly don't think anyone will even
      miss this feature.
    - Changed bli_ukr_get_funcs() and bli_ukr_get_ref_funcs() to call
      bli_cntl_init() instead of bli_init().
    - Removed query functions from bli_info.c that are no longer applicable
      given the dynamic memory allocator.
    - Removed unnecessary definitions from configurations' bli_config.h files,
      which are now pleasantly sparse.
    - Fixed incorrect flop counts in addv, subv, scal2v, scal2m testsuite
      modules. Thanks to Devangi Parikh for pointing out these
      miscalculations.
    - Comment, whitespace changes.

commit 9848f255a3bab17d1139c391cca13ff3f1ffe6ed
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jun 11 19:14:22 2015 -0500

    Added early return to API-level _init() routines.
    
    Details:
    - Added conditional code that returns early from the API-level _init()
      routines if the API is already initialized. Actually meant for this to
      be included in 5f93cbe8.

commit 5f93cbe870f3478870e15581e7fd450dad5bba1e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jun 11 18:52:12 2015 -0500

    Introduced API-level initialization.
    
    Details:
    - Added API-level initialization state to _const, _error, _mem, _thread,
      _ind, and _cntl APIs. While this functionality will mostly go unused,
      adding miniscule overhead at init-time, there will be at least once
      instance in the near future where, in order to avoid an infinite loop,
      a certain portion of the initialization will call a query function that
      itself attempts to call bli_init(). API-level initialization will allow
      this later stage to verify that an earlier stage of initialization has
      completed, even if the overall call to bli_init() has not yet returned.
    - Added _is_initialized() functions for each API, setting the underlying
      bool_t during _init() and unsetting it during _finalize().
    - Comment, whitespace changes.

commit ee129c6b028bc5ac88da7c74fde72c49803742ff
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jun 10 12:53:28 2015 -0500

    Fixed bugs in _get_range(), _get_range_weighted().
    
    Details:
    - Fixed some bugs that only manifested in multithreaded instances of
      some (non-gemm) level-3 operations. The bugs were related to invalid
      allocation of "edge" cases to thread subpartitions. (Here, we define
      an "edge" case to be one where the dimension being partitioned for
      parallelism is not a whole multiple of whatever register blocksize
      is needed in that dimension.) In BLIS, we always require edge cases
      to be part of the bottom, right, or bottom-right subpartitions.
      (This is so that zero-padding only has to happen at the bottom, right,
      or bottom-right edges of micro-panels.) The previous implementations
      of bli_get_range() and _get_range_weighted() did not adhere to this
      implicit policy and thus produced bad ranges for some combinations of
      operation, parameter cases, problem sizes, and n-way parallelism.
    - As part of the above fix, the functions bli_get_range() and
      _get_range_weighted() have been renamed to use _l2r, _r2l, _t2b,
      and _b2t suffixes, similar to the partitioning functions. This is
      an easy way to make sure that the variants are calling the right
      version of each function. The function signatures have also been
      changed slightly.
    - Comment/whitespace updates.
    - Removed unnecessary '/' from macros in bli_obj_macro_defs.h.

commit 9135dfd69d39f3bbd75034f479f27a78dbfebcce
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jun 5 13:37:44 2015 -0500

    Minor updates to test/3m4m files.

commit d62ceece943b20537ec4dd99f25136b9ba2ae340
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jun 3 12:56:45 2015 -0500

    Minor update to test/3m4m/runme.sh.
    
    Details:
    - Removed some stale script code that should have been removed
      during 590bb3b8c.

commit b6ee82a3d421c9c4f1eb6848c7c6e37aa46de799
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jun 3 12:14:23 2015 -0500

    Minor cleanup to bli_init() and friends.
    
    Details:
    - Spun-off initialization of global scalar constants to bli_const_init()
      and of threading stuff to bli_thread_init().
    - Added some missing _finalize() functions, even when there is nothing
      to do.

commit 1213f5cebabc1637ce9dd45c4bfa87bb93677c29
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jun 2 13:27:47 2015 -0500

    POSIX thread bugfixes/edits to bli_init.c, _mem.c.
    
    Details:
    - Fixed a sort-of bug in bli_init.c whereby the wrong pthread mutex
      was used to lock access to initialization/finalization actions.
      But everything worked out okay as long as bli_init() was called by
      single-threaded code.
    - Changed to static initialization for memory allocator mutex in
      bli_mem.c, and moved mutex to that file (from bli_init.c).
    - Fixed some type mismatches in bli_threading_pthreads.c that resulted
      in compiler warnings.
    - Fixed a small memory leak with allocated-but-never-freed (and unused)
      pthread_attr_t objects.
    - Whitespace changes to bli_init.c and bli_mem.c.

commit 590bb3b8c5c0389159c5a9451b6c156c5f237e8a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun May 24 16:02:53 2015 -0500

    Backed-out adjusted dim changes to test/3m4m.
    
    Details:
    - Reverted most changes applied during commit ec25807b.

commit ec25807b26da943868f0d0517c3720e50181b8f9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Apr 10 13:23:50 2015 -0500

    Tweaks to test/3m4m to test with adjusted dims.
    
    Details:
    - Updated test/3m4m driver files to build test drivers that allow
      comparision of real "asm_blis" results to complex "asm_blis" results,
      except with the latter's problem sizes adjusted so that problems are
      generated with equal flop counts.

commit 426b6488580a92bf071a62dc319a9c837ce39821
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Apr 8 15:12:21 2015 -0500

    Fixed a packing bug that manifested in trsm_r.
    
    Details:
    - Fixed a bug that caused a memory leak in the contiguous memory
      allocator. Because packm_init() was using simple aliasing when
      a subpartition object was marked as zeros by bli_acquire_mpart_*(),
      the "destination" pack object's mem_t entry was being overwritten
      by the corresponding field of the "source" object (which was likely
      NULL). This prevented the block from being released back to the
      memory allocator. But this bug only manifested when changing the
      location of packing B from outside the var1 loop to inside the
      var3 loop, and only for trsm with triangular B (side = right). The
      bug was fixed by changing the type of alias used in packm_init()
      when handling zero partition cases. Specifically, we now use
      bli_obj_alias_for_packing(), which does not clobber the destination
      (pack) object's mem_t field. Thanks to Devangi Parikh for this bug
      report.

commit c84286d5cef48f16d83831baac1f46b9856b9a36
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Apr 4 15:39:14 2015 -0500

    More minor tweaks to test/3m4m.
    
    Details:
    - Added a line of output that forces matlab to allocate the entire array
      up-front.
    - Re-enabled real domain benchmarks in runme.sh, which were temporarily
      disabled.

commit 309717c8ebf4ef1369f15cf41340e13c25b41573
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Apr 3 19:28:49 2015 -0500

    More tweaks to test/3m4m, configurations.
    
    Details:
    - Fixed incorrect number of mc_x_kc memory blocks in
      sandybridge/bli_config.h.
    - Enabled OpenMP multithreding in piledriver/bli_config.h.
    - More updates to test/3m4m driver files.

commit 4baf3b9c69b2f648be9e46e07ccc9859dd675828
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Apr 3 16:44:32 2015 -0500

    Tweaked test/3m4m driver, including acml support.
    
    Details:
    - Added ACML support to test/3m4m driver Makefile and runme.sh script.

commit a32f7c49ca4ea869d2a6c66818780f4321743d67
Merge: 349e075a 4bfd1ce8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Apr 3 08:28:11 2015 -0500

    Merge pull request #23 from xianyi/master
    
    Add auto-detecting CPU  on configure stage.

commit 349e075ad6a8e2a1211d94f36d24828c9d44b052
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 2 18:12:28 2015 -0500

    Tweaks to sandybridge config, test/3m4m driver.
    
    Details:
    - Enable OpenMP support by default in sandybridge's bli_config.h.
    - Reorganized sandybridge's bli_kernel.h.
    - Updated 3m4m Makefile, runme.sh to also test MKL implementation.

commit 4bfd1ce8ca93f93d170dd2715f0a32027b417b46
Author: Zhang Xianyi <traits.zhang@gmail.com>
Date:   Thu Apr 2 16:40:21 2015 -0500

    Detect NEON for cortex-a9 and cortex-a15.

commit aa6eec4f43137057276fe6119bdbfb5c52682527
Author: Zhang Xianyi <traits.zhang@gmail.com>
Date:   Thu Apr 2 16:03:44 2015 -0500

    Detect the CPU architecture. Support ARM cores.
    
    Detect the CPU architecture by compiler's predefined macros.
    Then, detect the CPU cores.
    
    Support detecting x86 and ARM architectures.

commit 2947cfb749c937b0f62fac36cc92f123bd45b53c
Author: Zhang Xianyi <traits.zhang@gmail.com>
Date:   Wed Apr 1 12:24:00 2015 -0500

    Add auto-detecting CPU  on configure stage.
    e.g.  /Path_to_BLIS/configure auto
    
    Now, it only support detecting x86 CPUs.

commit 26a4b8f6f985597f80e0174990bf541f1d9bafac
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Apr 1 10:44:54 2015 -0500

    Implemented 3m2, 3m3 induced algorithms (gemm only).
    
    Details:
    - Defined a new "3ms" (separated 3m) pack schema and added appropriate
      support in packm_init(), packm_blk_var2().
    - Generalized packm_struc_cxk_3mi to take the imaginary stride (is_p)
      as an argument instead of computing it locally. Exception: for trmm,
      is_p must be computed locally, since it changes for triangular
      packed matrices. Also exposed is_p in interface to dt-specific
      packm_blk_var2 (and _var1, even though it does not use imaginary
      stride).
    - Renamed many functions/variables from _3mi to _3mis to indicate that
      they work for either interleaved or separated 3m pack schemas.
    - Generalized gemm and herk macro-kernels to pass in imaginary stride
      rather than compute them locally.
    - Added support for 3m2 and 3m3 algorithms to frame/ind, including 3m2-
      and 3m3-specific virtual micro-kernels.
    - Added special gemm macro-kernels to support 3m2 and 3m3.
    - Added support for 3m2 and 3m3 to testsuite.
    - Corrected the type of the panel dimension (pd_) in various macro-
      kernels from inc_t to dim_t.
    - Renamed many functions defined in bli_blocksize.c.
    - Moved most induced-related macro defs from frame/include to
      frame/ind/include.
    - Updated the _ukernel.c files so that the micro-kernel function pointers
      are obtained from the func_t objects rather than the cpp macros that
      define the function names.
    - Updated test/3m4m driver, Makefile, and run script.

commit ddf62ba7d2da08225b201585b85e06c967767dea
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Mar 27 14:27:51 2015 -0500

    Refuse to free the packm thread info if it uses the single threaded version

commit 016fc587584d958a0e430a56a5e2c05022ac2f17
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Mar 27 14:23:02 2015 -0500

    Don't free packm thread info if it is null

commit 00a443c529a60862a57b93e303a0b3212c9b1df4
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Mar 27 14:11:07 2015 -0500

    Use bli_malloc instead of malloc for the thread info paths

commit f1a6b7d02861ccebdc500ea98778cc0f6cddad17
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Mar 18 15:37:10 2015 -0500

    Reorganized code for induced complex methods.
    
    Details:
    - Consolidated most of the code relating to induced complex methods
      (e.g. 4mh, 4m1, 3mh, 3m1, etc.) into frame/ind. Induced methods
      are now enabled on a per-operation basis. The current "available"
      (enabled and implemented) implementation can then be queried on
      an operation basis. Micro-kernel func_t objects as well as blksz_t
      objects can also be queried in a similar maner.
    - Redefined several micro-kernel and operation-related functions in
      bli_info_*() API, in accordance with above changes.
    - Added mr and nr fields to blksz_t object, which point to the mr
      and nr blksz_t objects for each cache blocksize (and are NULL for
      register blocksizes). Renamed the sub-blocksize field "sub" to
      "mult" since it is really expressing a blocksize multiple.
    - Updated bli_*_determine_kc_[fb]() for gemm/hemm/symm, trmm, and
      trsm to correctly query mr and nr (for purposes of nudging kc).
    - Introduced an enumerated opid_t in bli_type_defs.h that uniquely
      identifies an operation. For now, only level-3 id values are defined,
      along with a generic, catch-all BLIS_NOID value.
    - Reworked testsuite so that all induced methods that are enabled
      are tested (one at a time) rather than only testing the first
      available method.
    - Reformated summary at the beginning of testsuite output so that
      blocksize and micro-kernel info is shown for each induced method
      that was requested (as well as native execution).
    - Reduced the number of columns needed to display non-matlab
      testsuite output (from approx. 90 to 80).

commit 8d5169ccda954e5f72944308a036dcb7ebfc9097
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Mar 18 11:38:08 2015 -0500

    Fixed bug in release of mem_t buffer.
    
    Details:
    - Fixed a bug that affects all level-2 and level-3 blocked variants. The
      bug only manifested, however, if the packing of operands (A and B in
      gemm, for example) spanned multiple nodes in the control tree. Until
      recently, the main consumers of packm were level-3 operations, all of
      which packed both input operands from blocked variant 1 (B outside of
      the loop, and A within the loop). This particular usage masked a flaw
      in the code whereby bli_obj_release_pack() would always release the
      underlying mem_t buffer (provided it was allocated), even if the buffer
      was not allocated in the current variant. This has been fixed by
      replacing all calls to bli_obj_release_pack() with calls to a new
      function, bli_packm_release(), which takes the same control tree node
      argument passed into the object's corresponding call to packm_init()
      or packv_init(). bli_packm_release() then proceeds to invoke
      bli_obj_release_pack() only if the control tree node indicates that
      packing was requested. Thanks to Devangi Parikh for identifying this
      bug.

commit c0acca0f5182ba96fd39c9d10b34a896a6e74206
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 3 10:56:22 2015 -0600

    Clarified comments in testsuite input.operations.

commit 03ba9a6b17861d9e1adc0cf924439c4d7e860d19
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 24 10:33:28 2015 -0600

    Removed some 'old' directories.

commit a86db60ee270cdeb745ae7cf68f9e0becc9f522d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Feb 23 18:42:39 2015 -0600

    Extensive renaming of 3m/4m-related files, symbols.
    
    Details:
    - Renamed all remaining 3m/4m packing files and symbols to 3mi/4mi
      ('i' for "interleaved"). Similar changes to 3M/4M macros.
    - Renamed all 3m/4m files and functions to 3m1/4m1.
    - Whitespace changes.

commit 8cf8da291a0fb2f491f410969a76ec0fbda47faf
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Feb 20 15:24:27 2015 -0600

    Minor updates to induced complex mode management.
    
    Details:
    - Relocated bli_4mh.c, bli_4mb.c, bli_4m.c, bli_3mh.c, bli_3m.c (and
      associated headers) from frame/base to frame/base/induced.
    - Added bli_xm.? to frame/base/induced, which implements
      bli_xm_is_enabled(), which detects whether ANY induced complex method
      is currently enabled.
    - The new function bli_xm_is_enabled() is now used in bli_info.c to
      detect when an induced complex method is used, so we know when to
      return blocksizes from one of the induced methods' blocksize objects.

commit 411e637ee7d1083a84f58f08938d51e63d7c3c9a
Merge: c2569b88 fc0b7712
Author: Tyler Michael Smith <tms@cs.utexas.edu>
Date:   Fri Feb 20 20:39:25 2015 -0600

    Merge branch 'master' of http://github.com/flame/blis

commit c2569b8803d4ccc1d7b6f391713461b51443601d
Author: Tyler Michael Smith <tms@cs.utexas.edu>
Date:   Fri Feb 20 20:38:19 2015 -0600

    Fixed a memory leak in freeing the thread infos

commit fc0b771227abf86d81f505b324f69f6e83db1d8f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Feb 20 11:47:44 2015 -0600

    Added max(mr,nr) to kc in static mem pools.
    
    Details:
    - Changed the static memory definitions to compute the maximum register
      blocksize for each datatype and add it to kc when computing the size
      of blocks of A and B. This formally accounts for the nudging of kc
      up to a multiple of mr or nr at runtime for triangular operations
      (e.g. trmm).

commit af32e3a608631953ef770341df10a14a991bf290
Author: Tyler Michael Smith <tms@cs.utexas.edu>
Date:   Thu Feb 19 22:51:11 2015 -0600

    Fixed a bug with get_range_weighted would return end = 0 for small problem sizes

commit 441d47542a64e131578d00da7404c1ed387a721c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 19 17:06:10 2015 -0600

    Renamed 3m and 4m symbols/macros to 3mi and 4mi.
    
    Details:
    - Renamed several variables and macros from 3m/4m to 3mi/4mi. This is
      because those packing schemas were always implicitly "interleaved".
      This new naming scheme will make way for new schemas that separate
      instead of interleve the real and imaginary (and summed) parts.
    - Expanded the pack format sub-field of the pack schema field of the
      info_t to 4 bits (from 3). This will allow for more schema types
      going forward.
    - Removed old _cntl.c files for herk3m, herk4m, trmm3m, trmm4m.

commit 518a1756ccf02122b96fc437b538604a597df42a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 19 14:27:09 2015 -0600

    Fixed indexing bug for trmm3 via 3mh, 4mh.
    
    Details:
    - Fixed a bug that only affected trmm3 when performed via 3mh or 4mh,
      whereby micro-panels of the triangular matrix were packed with "dead
      space" between them due to failing to adjust for the fact that pointer
      arithmetic was occurring in units of complex elements while the data
      being packed consisted of real elements. It turns out that the macro-
      kernel suffered from the same bug, meaning the panels were actually
      being packed and read consistently. The only way I was able to
      discover the bug in the first place was because the packed block of A
      was overflowing into the beginning of the packed row panel of B using
      the sandybridge configuration.

commit 493087d730f01d5169434f461644e5633f48a42f
Merge: 650d2a6f 25021299
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Feb 18 09:45:51 2015 -0600

    Merge branch 'master' of github.com:flame/blis

commit 25021299b670775df8ca9c87910c63d7e74ed946
Merge: fe2b8d39 f05a5763
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Feb 11 20:03:21 2015 -0600

    Merge branch 'master' of github.com:flame/blis

commit fe2b8d39a445ac848686e78c7540fd046cb95492
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Feb 11 19:33:10 2015 -0600

    Fixed an obscure bug in 3mh/3m/4mh/4m packing.
    
    Details:
    - Modified bli_packm_blk_var1.c and _var2.c to increase the triangular
      case's panel increment by 1 if it would otherwise be odd. This is
      particularly necessary in _var2.c when handling the interleaved 3m
      or ro/io/rpi pack schemas, since division of an odd number by 2 can
      happen if both the panel length and the panel packing dimension
      (register packing blocksize) are odd, thus making their product odd.
    - Modified bli_packm_init.c so that panel strides are increased by 1
      if they would otherwise be odd, even for non-3m related packing.
    - Modified the trmm and trsm macro-kernels so that triangular packed
      micro-panels are traversed with this new "increment by 1 if odd"
      policy.
    - Added sanity checks in trmm and trsm macro-kernels that would result
      in an abort() if the conditions that would lead to a "divide odd
      integer by 2" scenario ever manifest.
    - Defined bli_is_odd(), _is_even() macros in bli_scalar_macro_defs.h.

commit 650d2a6ff2e593151a296ca86b5214afcc747afc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Feb 9 14:59:20 2015 -0600

    Added initial support for imaginary stride.
    
    Details:
    - Added an imaginary stride field ("is") to obj_t.
    - Renamed bli_obj_set_incs() macro to bli_obj_set_strides().
    - Defined bli_obj_imag_stride() and bli_obj_set_imag_stride() and
      added invocations in key locations.
    - Added some basic error-checking related to imaginary stride.
    - For now, imaginary stride will not be exposed into the most-used
      BLIS APIs such as bli_obj_create(), and certainly not the
      computational APIs such as bli_dgemm().

commit f05a57634a7c8e3864b25b3335d1194c1ea1aeb9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Feb 8 19:40:34 2015 -0600

    Defined gemm cntl function to query ukrs func_t.
    
    Details:
    - Added a new function, bli_gemm_cntl_ukrs(), that returns the func_t*
      for the gemm micro-kernels from the leaf node of the control tree.
      This allows all the func_t* fields from higher-level nodes in the tree
      to be NULL, which makes the function that builds the control trees
      slightly easier to read.
    - Call bli_gemm_cntl_ukrs() instead of the cntl_gemm_ukrs() macro in
      all bli_*_front() functions (which is needed to apply the row/column
      preference optimization).
    - In all level-3 bli_*_cntl_init() functions, changed the _obj_create()
      function arguments corresponding to the gemm_ukrs fields in higher-
      level cntl tree nodes to NULL.
    - Removed some old her2k macro-kernels.

commit cefd3d5d2001264de17cf63dae541f890cb9daaf
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Thu Feb 5 11:09:12 2015 -0600

    A couple of functions were incorrectly ifdeffed away on Xeon Phi. Fixed this

commit 7574c9947d57a19f613880e3b9f62f8c8f6df4ec
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Feb 4 12:11:55 2015 -0600

    Added basic flop-counting mechanism (level-3 only).
    
    Details:
    - Added optional flop counting to all level-3 front-ends, which is
      enabled via BLIS_ENABLE_FLOP_COUNT. The flop count can be
      reset at any time via bli_flop_count_reset() and queried via
      bli_flop_count(). Caveats:
      - flop counts are approximate for her[2]k, syr[2]k, trmm, and
        trsm operations;
      - flop counts ignore extra flops due to non-unit alpha;
      - flop counts do not account for situations where beta is zero.

commit ceda4f27d1f1bcf19320e09848e0f2e3b9941e6c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jan 29 13:22:54 2015 -0600

    Implemented bli_obj_imag_equals().
    
    Details:
    - Implemented a new function, bli_obj_imag_equals(), which compares the
      imaginary part of the first argument to the second argument, which may
      be a BLIS_CONSTANT or of a regular real datatype.

commit 81114824a05a9053229efd577a8a94a856deda93
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jan 6 12:15:21 2015 -0600

    Minor 4m/3m consolidation to mem_pool_macro_defs.h.
    
    Details:
    - Merged the 4m and 3m definitions in bli_mem_pool_macro_defs.h to
      reduce code and improve readability.

commit 36a9b7b7436d9423ba4de2a9f85cfcd43577b783
Author: Tyler Michael Smith <tms@cs.utexas.edu>
Date:   Wed Dec 17 21:53:50 2014 +0000

    reduced the default number of MC by KC blocks for bgq

commit c60619c7c3568f044a849abbab60209aa7455423
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Dec 16 17:08:22 2014 -0600

    Minor tweaks for 3m4m test drivers.
    
    Details:
    - Changed gemm_kc blocksizes to be reduced by two-thirds instead of
      half.
    - Changed 3m4m/test_gemm.c driver to divide by 3 instead of 2 when
      computing the fixed k dimension.
    - Fixed runme.sh so that it would use multiple threads for s/dgemm
      cases.

commit c6929ba6a5e6f633a7295e979a2b8df8c7ecdb1b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Dec 16 11:27:50 2014 -0600

    Added 4m_1b to test/3m4m test driver and script.

commit 785d480805fc0d6f4251b5499933515740b6b2a7
Merge: 9456f330 4156c088
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Dec 12 14:34:19 2014 -0600

    Merge branch 'master' of github.com:flame/blis

commit 9456f330af4617f9ee32972d51f974aa2d84f97b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Dec 12 14:31:57 2014 -0600

    Added 4m_1b implementation for gemm.
    
    Details:
    - Added yet another 4m-based implementation for complex domain level-3
      operations. This method, which the 3m/4m paper identifies as Algorithm
      "4m_1b" fissures the first loop around the micro-kernel so that the
      real sub-panel of the current micro-panel of B is multiplied against
      (both sub-panels of) all micro-panels of A, before doing the same for
      the imaginary sub-panel of the micro-panel of B. For now, only gemm is
      supported, and 4m_1b (labeled "4mb" within the framework) is not yet
      integrated into the test suite.

commit 4156c0880d9aea4ff04a9c4fa139ba8c437d8bfb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Dec 9 16:03:14 2014 -0600

    Fixed obscure level-2 packing / general stride bug.
    
    Details:
    - Fixed a bug in certain structured level-2 operations that manifested
      only when the structured matrix was provided to BLIS as matrix stored
      with general stride. The bug was introduced in c472993b when the
      densify field was removed from the packm control tree node and
      associated APIs. Since then, the packed object was unconditionally
      marked with an uplo field of BLIS_DENSE. This is fine for level-3
      operations where micro-panels are always densified, but in level-2
      contexts, the underlying unblocked variant (fused or unfused) of
      structured operations (e.g. trmv) still needs to know whether to
      execute its "lower" or "upper" branches of code. Since this field
      was unconditionally being set to BLIS_DENSE, the unblocked variants
      were always executed the "else" branch, which happened to be the
      "lower" case code. Thus, running an upper case produced the wrong
      answer. This most obviously manifested in the form of failures for
      trmm, trmm3, and trsm in the test suite.
      The bug was fixed by setting the packed object's uplo field to
      BLIS_DENSE only if the schema indicated that micro-panels were to be
      packed. Otherwise, we can assume we are packing to regular row or
      column storage, as is the case with level-2 packing. Thanks to
      Francisco Igual for reporting the testsuite failures and ultimately
      leading us to this bug.

commit 689f60a578b461119e9ea90c74f642b9eb79addb
Merge: bef24e67 483e4d6a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Dec 7 14:03:30 2014 -0600

    Merge pull request #21 from figual/master
    
    Adding armv8a configuration and micro-kernels.

commit 483e4d6a3fdbef9d9ab47fb674c9476c70ca9f0f
Author: Francisco D. Igual <figual@ucm.es>
Date:   Sun Dec 7 20:27:49 2014 +0100

    Adding armv8a configuration and micro-kernels.
    
    Only sgemm micro-kernel is fully functional at this point.

commit bef24e67e0f93579c2a80315348dc2e227f72a72
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed Nov 26 18:00:56 2014 -0600

    Fixed a type of race condition exposed by pthreads implementation.
    Lead thread of the inner thread communicator could exit subproblem, move on the next iteration of the loop and modify a1_pack, b1_pack, or c1_pack while other threads were still using those.
    
    Barriers were inserted to fix this.

commit 76bde44411f0e34266bab9d666a54ef22be97320
Merge: e56e6143 f3d729e5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 26 17:25:24 2014 -0600

    Merge branch 'master' of github.com:flame/blis

commit f3d729e504ec012e7dc7e02b2ecd42e004c6894d
Author: Tyler Michael Smith <tms@cs.utexas.edu>
Date:   Wed Nov 26 22:25:24 2014 -0600

    Added static mutex to bli_init and bli_finalize

commit d71cc797866ff502ad1127527016f463267eef80
Author: Tyler Michael Smith <tms@cs.utexas.edu>
Date:   Wed Nov 26 21:35:39 2014 -0600

    Refactored bli_threading files and added support for pthreads

commit e56e61438ff7fcf25a48c0b7603f18df782b50b6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 26 17:20:35 2014 -0600

    Minor cleanups to bli_threading.h and friends.
    
    Details:
    - No longer need to define BLIS_ENABLE_MULTITHREADING manually in
      bli_config.h; it now gets defined when BLIS_ENABLE_OPENMP or
      BLIS_ENABLE_PTHREADS is defined.
    - Added sanity check to prevent both BLIS__ENABLE_OPENMP and
      BLIS_ENABLE_PTHREADS from being enabled simultaneously.
    - Reorganization of bli_threading*.h header files, which led to
      simplification of threading-related part of blis.h.
    - added "-fopenmp -lpthread" to LDFLAGS of sandybridge make_defs.mk
      file.

commit 3be2744cbe2c56d38c23fd818aa5c1f10cc7ea51
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Nov 21 12:28:08 2014 -0600

    Update to template gemm ukernel comments.
    
    Details:
    - Updated comments on alignment of a1 and b1 to match wiki.

commit 994429c6881b2ade92d9d7949bcaebfbf2cc65eb
Merge: 58796abd 694029d9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Nov 20 13:55:35 2014 -0600

    Merge pull request #20 from TimmyLiu/master
    
    #define PASTEF773 required by cblas compatibility layer

commit 694029d9d7db857d642ab536955c0621791108c8
Author: Timmy <timmy.liu@amd.com>
Date:   Wed Nov 19 15:25:14 2014 -0600

    #define PASTEF773 required by cblas compatiility layer

commit 58796abda66b133346f8d523b39178afc336351f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Nov 6 14:31:52 2014 -0600

    Removed KC constraint comments from _kernel.h files.
    
    Details:
    - Since 4674ca8c, the constraint that KC be a multiple of both MR and
      NR have been relaxed, and thus it was time to remove the comments
      from the top of the bli_kernel.h files of all configurations.

commit 7bbc95a54f706d43c7f7951f0e5995f86130cd52
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 29 10:52:23 2014 -0500

    Added new piledriver micro-kernels.
    
    Details:
    - Added new micro-kernels for the AMD piledriver architecture (one
      for each datatype).
    - Updates and tweaks to piledriver configuration.
    - Added 3xk packm micro-kernel support.
    - Explicitly unrolled some of the smaller packm micro-kernels.
    - Added notes to avx/sandybridge and piledriver micro-kernel files
      acknowledging the influence of the corresponding kernel code in
      OpenBLAS.

commit 59613f1d5500f6279963327db2fbc84bc9135183
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 23 17:21:37 2014 -0500

    Added separeate micro-panel alignment for A and B.
    
    Details:
    - Changed the recently-added micro-panel alignment macros so that we now
      have two sets--one for micro-panels of matrix A and one for micro-
      panels of matrix B: BLIS_UPANEL_[AB]_ALIGN_SIZE_?.
    - Store each set of alignment values into a separate blksz_t object in
      bli_gemm_cntl_init().
    - Adjusted packm_init() to use the separate alignment values.
    - Added query routines for the new alignment values to bli_info.c.
    - Modified test suite output accordingly.

commit a8e12884ee1fddd3fd77ca5a68aa0cb857f3af57
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 23 11:35:48 2014 -0500

    CHANGELOG update (0.1.6)

commit 38ea5022e4ed846112198c4e1672fcdaeb90dc71 (tag: 0.1.6)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 23 11:35:45 2014 -0500

    Version file update (0.1.6)

commit a3e6341bdb0e28411f935d6b4708a6389663e004
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 23 11:13:28 2014 -0500

    Factored common code from blocksize functions.
    
    Details:
    - Split bli_determine_blocksize_[fb]() into two functions each, the
      newer ones ending with the _sub suffix. These new sub-functions are
      now called from bli_[gemm|trmm|trsm]_determine_kc_[fb](), which
      eliminates redundant code and will allow any future tweaks to the
      core sub-functions to automatically be inherited by the operation-
      specific versions.

commit 4674ca8cffb58331ff7edf23bbe0e3f6a7558489
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 23 10:50:59 2014 -0500

    Extended newly relaxed KC to hemm, symm.
    
    Details:
    - These changes were intended for the previous commit.
    - Defined bli_gemm_determine_kc_[fb]() and bli_gemm_determine_kc_[fb](),
      which determine blocksizes for gemm-based operations, taking special
      care to "nudge" the kc dimension up to a multiple of MR or NR for
      hemm and symm operations, as needed.
    - Changed bli_gemm_blk_var3f.c to call bli_gemm_determine_kc_f().
      instead of bli_determine_blocksize_f().
    - Comment updates to bli_trmm_blocksize.c, bli_trsm_blocksize.c.

commit ab954ba6f874eaca7b001804491f866ef6b9b327
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 22 17:21:58 2014 -0500

    Relaxed constraint that KC be multiple of MR, NR.
    
    Details:
    - Relaxed a long-held requirement in register blocksizes that required
      the kernel programmer to choose a KC that was divisible by both MR
      and NR. This was very constraining on some architectures that did not
      use register blocksizes that were powers of two. The constraint is
      now enforced only for trmm and trsm, where it is needed, and it is
      now handled by "nudging" kc upward at runtime, if necessary, to be a
      multiple of MR or NR, as needed.
    - Defined bli_trmm_determine_kc_[fb]() and bli_trsm_determine_kc_[fb](),
      which determine blocksizes for trmm and trsm, taking special care to
      "nudge" the kc dimension up to a multiple of MR or NR, as needed.
    - Changed bli_trmm_blk_var3[fb].c to call bli_trmm_determine_kc_[fb]()
      instead of bli_determine_blocksize_[fb]().
    - Added safeguard to bli_align_dim_to_mult() that returns the dimension
      unmodified if the dimension multiple is zero (to avoid division by
      zero).
    - Removed cpp guard/check for KC % MR == 0 and KC % NR == 0 from
      bli_kernel_macro_defs.h.
    - Whitespace, variable name changes to bli_blocksize.c.
    - Removed old commented code from bli_gemm_cntl.c.

commit 95cdae65d6b88e043ee14bcd53cd2e800d7aecb4
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed Oct 22 16:30:16 2014 -0500

    Fixed bug in KNC microkernel where k=0 and beta != 1

commit e64dba5633fc49b768b5edc7762f2b5d8a4d0588
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 20 19:23:06 2014 -0500

    Re-implemented micro-panel alignment.
    
    Details:
    - This commit re-implements a feature that was removed in commit
      c2b2ab62. It was removed because, at the time, I wasn't sure how the
      micro-panel alignment feature would interact with the 4m method (when
      applied at the micro-kernrel level), and so it seemed safer to disable
      the feature entirely rather than allow possible breakage. This commit
      revisits the issue and safely re-implements the feature in a way that
      is compatible with 4m, 3m, 4mh, and 3mh (and native execution).
    - Modified the static memory pool to account for micro-panel alignment
      space.
    - Modified packm_init and blocked variants to align whole micro-panels
      by a datatype-specific alignment value that may be set by the
      configuration. (If it is not set by the configuration, it will default
      to BLIS_SIZEOF_?.)
    - Modified macro-kernels so that:
      - storage stride is handled properly given the new micro-panel
        alignment behavior;
      - indexing through 3m/4m/rih-type sub-panels, as is done by trmm and
        trsm, is more robust (e.g. will work if the applicable packing
        register blocksize is odd);
      - imaginary strides are computed and stored within auxinfo_t structs,
        which allows the virtual micro-kernels to more easily determine how
        to index into the micro-panel operands.
    - Modified virtual 3m and 4m micro-kernels to use the imaginary strides
      within the auxinfo_t structs instead of panel strides.
    - Deprecated the panel stride fields from the auxinfo_t structs.
    - Updated test suite to print out the micro-panel alignment values.

commit add16b0e5402924301e7078e4ca5e3ef725bff0b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 17 11:49:24 2014 -0500

    Added 3m4m test driver subdir of 'test'.
    
    Details:
    - Added a modified test driver for [cz]gemm that will test all 3m/4m
      as well as assembly-based and OpenBLAS implementations of gemm
      in single and multithreaded modes.

commit e171504a72406c61a173241d8bccf0a5ceb10582
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 17 11:25:59 2014 -0500

    Use correct definition of bli_is_last_iter().
    
    Details:
    - As intended for previous commit, the new definition of
      bli_is_last_iter() is now disabled in favor of the old
      definition.

commit 0d954087b2b55d2f5f3c5e57d702b318ca2300f6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 17 11:19:34 2014 -0500

    Minor changes and fixes.
    
    Details:
    - Redefined bli_is_last_iter() to take thread_id and num_thread
      arguments, which allows the macro to correctly compute whether a
      given iteration is the last that the thread will compute in that
      particular loop. The new definition, however, remains disabled
      (commented out) until someone can look at this more closely, as
      the new definition seems to actually hurt performance slightly.
    - Whitespace and related updates to level-3 macro-kernels.
    - Updated test suite so that performance results in the hundreds of
      gigaflops does not disrupt the column alignment of the output.

commit d1e86e1876e433f54b501ec5a005b4ba7c5ce4e6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Oct 12 13:43:47 2014 -0500

    More minor tweaks to sandybridge/avx micro-kernel.
    
    Details:
    - Re-enabled use of b_next for dgemm and cgemm micro-kernels.

commit 7b6fe4cae57cb22c09c1a97595e1a201a02cbcd2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Oct 12 12:01:51 2014 -0500

    Minor tweaks to sandybridge/avx micro-kernels.
    
    Details:
    - Changed the MC blocksize for zgemm micro-kernel from 128 to 64.
    - Removed usage of b_next in all x86_64/avx gemm micro-kernels.

commit a6a156e9feec47154e7a0fd43bcc006b1fc04aba
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 10 14:26:41 2014 -0500

    Added cgemm ukernel for avx/sandybridge.
    
    Details:
    - Implemented AVX-based cgemm micro-kernel (via GNU extended inline
      assembly syntax).
    - Updated sandybridge configuration accordingly.

commit 6f8575ab2580e167a022293b76ddf0514f71b613
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 10 10:01:45 2014 -0500

    Added zgemm ukernel for avx/sandybridge.
    
    Details:
    - Implemented AVX-based zgemm micro-kernel (via GNU extended inline
      assembly syntax).
    - Updated sandybridge configuration accordingly.

commit 23ce7ee542a12ca40b4b6090ad2558d180e16d37
Merge: 99fd9a39 7a8ad47f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 9 16:41:22 2014 -0500

    Merge branch 'master' of github.com:flame/blis

commit 99fd9a39718cb7281f6fb23f9fef7cca4fe514f4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 9 16:38:04 2014 -0500

    Fixed two minor bugs.
    
    Details:
    - Fixed a bug in the test suite for the trsm_ukr and gemmtrsm_ukr test
      modules whereby the uplo bits of some packed matrix objects were not
      being set properly, resulting in false FAILURE results for those
      tests. Thanks to Tyler Smith for bringing this issue to my attention.
    - Fixed a bug in bli_obj_alloc_buffer() that caused an unnecessary
      "not yet implemented" abort() when creating a 1x1 object with non-unit
      strides.

commit 7a8ad47fb2d100a9da93aa8cab774fcceeaab733
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed Oct 8 15:52:13 2014 -0500

    Minor changes to knc configuration, including preference row major storage
    Also fixed a bug in the knc micro-kernel where it would fail if k == 0

commit 76b7c34af0c09f47d9615b18857a356acddc788a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 2 14:15:38 2014 -0500

    Fixed a bug in the pack schema-related bit macros.
    
    Details:
    - Expanded the BLIS_PACK_SCHEMA_BITS value in bli_type_defs.h to
      include all six bits presently used in the pack schema bitfield of
      the info field of obj_t structs. Prior to this commit, the macro
      constant only included the lowest five bits, which excluded the
      "is or is not packed" bit. This manifested as a strange bug in
      probably many level-2 codes that invoked packing, though we only
      observed it in ger before fixing. Thanks to Devin Matthews for
      finding and reporting this bug.

commit a5763e332226598d70c47dfa9cad4578e15ef5f4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 2 13:28:17 2014 -0500

    Added extra output to bli_obj_print().
    
    Details:
    - Print extra values from info field of obj_t struct within
      bli_obj_print().

commit 9bba209fc44fbfce943ba6a51cd8278a0cb6b159
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Mon Sep 29 14:56:36 2014 -0500

    Fixed bug when packing anywhere besides in blk_var_1 for gemm.

commit 614a4afc9272adb47e5a8b83b39d56c2804d95d6
Merge: b541b667 4a7df04e
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Sep 26 10:49:57 2014 -0500

    Merge branch 'master' of http://github.com/flame/blis

commit 4a7df04e8a4ffdb9561d26426afd35e4fe15b013
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Sep 22 16:06:15 2014 -0500

    Added 30xk support for packm ukernels.
    
    Details:
    - Updated bli_kernel_*_macro_defs.h headers to include default
      definitions for 30xk packm kernels.
    - Extended function pointer arrays in bli_packm_cxk_*() out to 31 and
      included 30xk kernels.
    - Addex 30xk kernels to frame/1m/packm/ukernels/bli_packm_ref_cxk_*.c.

commit b6d4bd792e0d44ce4b28afef343f5ff3ba89c285
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Sep 22 16:02:37 2014 -0500

    Fixed missing tabs from Makefile patch.

commit 32630f9b6f0d5ba28d5b56dae4c7288a37158743
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Sep 19 17:18:20 2014 -0500

    Comment update to virtual micro-kernels.

commit 13447cffead7c6d137a7a3ccbf9e552ed0477467
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Sep 19 13:00:48 2014 -0500

    Minor bugfix to top-level Makefile.
    
    Details:
    - Applied a patch that allows the top-level Makefile to work on certain
      systems. The patch simply separates out the source-to-object code
      generation rules for .c and .S files into two separate rules. Thanks
      to Devin Matthews for submitting this patch.

commit e80a4537846416719c067ae08a53aeda978c572d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Sep 18 10:24:20 2014 -0500

    Fixed bug introduced by bugfix in 25b258d.
    
    Details:
    - We actually need to check alignment of lda*sizeof(double) and NOT
      a+lda because in the latter case, alignment could cancel out and
      still allow the optimized code to run when it shouldn't. Thanks
      to Devin for pointing this out.

commit 25b258d61f9c8cee64e922f4131784b6edb196dd
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Sep 18 10:10:49 2014 -0500

    Fixed a non-fatal problem with bugfix in a68b316c.
    
    Details:
    - The bugfix in a68b316c was inadvertantly checkin alignment of the
      leading dimension itself, rather than the byte size of the leading
      dimension. Now, we simply check alignment of a+lda.

commit 96302d4fc81363410e41c3a3c43a65df44d97ad9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Sep 18 09:43:40 2014 -0500

    Renamed bli_info_get_*_ukr_type() functions.
    
    Details:
    - Added _string() suffix to bli_info_get_*_ukr_type() function names.
      This makes them consistent with the bli_info_get_*_impl_string()
      functions.

commit a68b316ca4852509f84ed50e01afac486bf70f58
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Sep 17 11:10:07 2014 -0500

    Fixed alignment bugs in level-1f kernels.
    
    Details:
    - Fixed bugs whereby the level-1f dotxf, axpyxf, and dotxaxpyf kernels
      were attempting to compute problems with unaligned leading dimensions
      with optimized code, rather than (correctly) using the reference
      implementations. Thanks to Devin Matthews for reporting this bug.

commit 870761eb902e4866090d1d3446a345df3d6d4599
Merge: e9899be0 a2b59a37
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 16 18:20:49 2014 -0500

    Merge branch 'master' of github.com:flame/blis

commit e9899be09044829e23386bd73e394f1dd7778210
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 16 18:19:32 2014 -0500

    Added high-level implementations of 4m, 3m.
    
    Details:
    - Added "4mh" and "3mh" APIs, which implement the 4m and 3m methods at
      high levels, respectively. APIs for trmm and trsm were NOT added due
      to the fact that these approaches are inherently incompatible with
      implementing 4m or 3m at high levels (because the input right-hand
      side matrix is overwritten).
    - Added 4mh, 3mh virtual micro-kernels, and updated the existing 4m and
      3m so that all are stylistically consistent.
    - Added new "rih" packing kernels (both low-level and structure-aware)
      to support both 4mh and 3mh.
    - Defined new pack_t schemas to support real-only, imaginary-only, and
      real+imaginary packing formats.
    - Added various level0 scalar macros to support the rih packm kernels.
    - Minor tweaks to trmm macro-kernels to facilitate 4mh and 3mh.
    - Added the ability to enable/disable 4mh, 3m, and 3mh, and adjusted
      level-3 front-ends to check enabledness of 3mh, 3m, 4mh, and 4m (in
      that order) and execute the first one that is enabled, or the native
      implementation if none are enabled.
    - Added implementation query functions for each level-3 operation so
      that the user can query a string that describes the implementation
      that is currently enabled.
    - Updated test suite to output implementation types for reach level-3
      operation, as well as micro-kernel types for each of the five micro-
      kernels.
    - Renamed BLIS_ENABLE_?COMPLEX_VIA_4M macros to _ENABLE_VIRTUAL_?COMPLEX.
    - Fixed an obscure bug when packing Hermitian matrices (regular packing
      type) whereby the diagonal elements of the packed micro-panels could
      get tainted if the source matrix's imaginary diagonal part contained
      garbage.

commit a2b59a37f166f70a6dd5793db2530823ef590c2b
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Mon Sep 15 10:44:44 2014 -0500

    Fixed make defs so that they actually compile for bulldozer

commit 86fc7e40764f78ec217f50216ef4fa5b57dbfbc7
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Mon Sep 15 10:35:46 2014 -0500

    Added bulldozer configuration and updated piledriver micro-kernel

commit 0644e61a79a57f136be5f4c47b9099cff2af06e0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Sep 11 12:55:34 2014 -0500

    Minor updates to bli_packm_init.c.

commit 9dc9b44a057a08e20ad4d423344f0ecad54c1eb2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Sep 11 12:03:28 2014 -0500

    Renamed bli_obj_pack_status() to _pack_schema().
    
    Details:
    - Renamed the bli_obj_pack_status() macro to bli_obj_pack_schema() in
      order to help avoid confusion as to what the macro returns.

commit cf5efdde0588a0d5b6ea57fe7d7be5000be06f8e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Sep 11 11:47:56 2014 -0500

    Pass pack_t schemas into ukernels via auxinfo_t.
    
    Details:
    - Modified macro-kernels to pass the pack_t schema values for matrices
      A and B into the datatype-specific functions, where they are now
      inserted into a newly-expanded auxinfo_t struct. This gives gives the
      micro-kernels access to the pack_t schema values embedded in the
      control trees, which determine the precise format into which the
      matrix elements are packed.
    - Updated a call to bli_packm_init_pack() in src/test_libblis.c to
      remove densify argument. Meant to include this in commit c472993b.

commit cc8d2b82775cca3c2d51bf427f4e77c8024a6d15
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 9 13:48:22 2014 -0500

    Updated old test drivers in 'test'.

commit c472993bbccb69e9ffc409c79b742426c8ad2ad4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 9 13:42:04 2014 -0500

    Removed densify argument to packm_cntl_obj_create().
    
    Details:
    - Removed the "densify" bool_t argument to bli_packm_cntl_obj_create().
      This argument was inserted very early in BLIS's development, when it
      was anticipated that the developer may sometimes wish to pack a
      Hermitian, symmetric, or triangular matrix without making it dense.
      But as it turns out, if we are packing a matrix, we always want to
      make it dense in some way or another due to the fact that the micro-
      kernel only multiplies dense micro-panels. Thus, unless/until there
      is a real need for the feature, it seems reasonable to remove it from
      the packm_cntl API.

commit 5c43ee387146cd76dc59b730dac6683a8446b834
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Sep 8 15:19:29 2014 -0500

    Moved trmm4m/3m_cntl files to 'old' directory.
    
    Details:
    - Meant to include this in previous commit.

commit 7b2f469d5465ed73b1ca88124bc9a1987388aa27
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Sep 8 14:49:50 2014 -0500

    Retired trmm_t control tree definitions, usage.
    
    Details:
    - Replaced all trmm_t control tree instances and usage with that of
      gemm_t. This change is similar to the recent retirement of the herk_t
      control tree.
    - Tweaked packm blocked variants so that the triangular code does NOT
      assume that k is a multiple of MR (when A is triangular) or NR (when
      B is triangular). This means that bottom-right micro-panels packed for
      trmm will have different zero-padding when k is not already a multiple
      of the relevant register blocksize. While this creates a seemingly
      arbitrary and unnecessary distinction between trmm and trsm packing,
      it actually allows trmm to be handled with one control tree, instead
      of one for left and one for right side cases. Furthermore, since only
      one tree is required, it can now be handled by the gemm tree, and thus
      the trmm control tree definitions can be disposed of entirely.
    - Tweaked trmm macro-kernels so that they do NOT inflate k up to a
      multiple of MR (when A is triangular) or NR (when B is triangular).
    - Misc. tweaks and cleanups to bli_packm_struc_cxk_4m.c and _3m.c, some
      of which are to facilitate above-mentioned changes whereby k is no
      longer required to be a multiple of register blocksize when packing
      triangular micro-panels.
    - Adjusted trmm3 according to above changes.
    - Retired trmm_t control tree creation/initialization functions.

commit 576e9e9255a79dba9cd3c804267f51e0b4aa6e8a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Sep 7 16:12:52 2014 -0500

    Retired herk_t control tree definitions, usage.
    
    Details:
    - Replaced all herk_t control tree instances and usage with that of
      gemm_t, since the two types presently have the same fields. This means
      that herk, her2k, syrk, and syr2k can simply use the gemm control tree
      as-is, just as hemm and symm have been doing for some time now.
    - Retired herk_t control tree creation/initialization functions.
    - Retired many _target.c and .h files into 'old' directories.

commit b2fed052c9a23d858ef0afbe220b342bce9aa7f7
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Sep 3 17:07:25 2014 -0500

    Minor code cleanup to bli_packm_struc_cxk*.c
    
    Details:
    - Realized that we don't need to track rs_p11 and cs_p11 for
      Hermitian/symmetric case of bli_packm_struc_cxk*(). They are always
      equal to rs_p and cs_p.

commit 023ce770966b3b5a98bba729c5af1f45e15ebb97
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Sep 3 10:47:53 2014 -0500

    Minor update to packm_cxk kernels.
    
    Details:
    - Changed m and n dimension parameter names to panel_dim and panel_len,
      respectively, in packm_cxk, packm_cxk_3m, packm_cxk_4m kernel wrapper
      functions. This makes the code a little easier to read since "m" and
      "n" have connotations that are not applicable here.
    - Comment updates.

commit 189def3667d9218adbeec45e2801fd074341a679
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Sep 1 16:23:17 2014 -0500

    Retired portions of bli_kernel_3m/4m_macro_defs.h.
    
    Details:
    - Removed sections of bli_kernel_[4m|3m]_macro_defs.h that defined
      4m/3m-specific blocksizes after realizing that this can be done in
      bli_gemm[4m|3m]_cntl.c, since that is (mostly) the only place they
      are used.
    - The maximum cache values for 4m/3m are stll needed when computing mem
      pool dimensions in bli_mem_pool_macro_defs.h. As a workaround, "local"
      definitions in terms of the regular cache blocksizes are now in place.
    - Similarly, the register blocksizes for 4m/3m are still needed in
      bli_kernel_post_macro_defs.h. As a workaround, "local" definitions in
      terms of the regular register blocksizes are now in place.

commit af521ee6f2a77d61c98b833e85c09969987bc00d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Sep 1 14:06:46 2014 -0500

    Changed semantics of blocksize extensions.
    
    Details:
    - Changed semantics of cache and register blocksize extensions so that
      the extended values are tracked, rather than just the marginal
      extensions.
    - BLIS_EXTEND_[MKN]C_? has been renamed BLIS_MAXIMUM_[MKN]C_?.
    - BLIS_EXTEND_[MKN]R_? has been renamed BLIS_PACKDIM_[MKN]R_?.
    - bli_blksz_ext_*() APIs have been renamed to bli_blksz_max_*(). Note
      that these "max" query routines grab the maximum value for cache
      blocksizes and the packdim value for register blocksizes.
    - bli_info_*() API has been updated accordingly.
    - All configurations have been updated accordingly.

commit 07f23aefd52f5ba4960dbd46e59b180a2136b8e9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Aug 31 11:58:50 2014 -0500

    Pass pack schema into packm_struc_cxk*().
    
    Details:
    - Changed the interface to the packm_struc_cxk*() kernels to include
      the pack_t schema. This allows the implementation to more easily
      determine how the micro-panel is stored (row-stored column panel
      or column-stored row panel).
    - Updated packm blocked variants to pass in the schema.
    - Updated packm_ker_t function pointer definition accordingly.

commit f032ba9b1186cb02184574d339565f53d733aa42
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Aug 30 16:21:20 2014 -0500

    Reorganized packm implementation.
    
    Details:
    - Reorganized packm variants and structure-aware kernels so that all
      routines for a given pack format (4m, 3m, regular) reside in a single
      file.
    - Renamed _blk_var4 to _blk_var2 and generalized so that it will work
      for
      both 4m and 3m, and adjusted 4m/3m _cntl_init() functions accordingly.
    - Added a new packm_ker_t function pointer type to
      bli_kernel_type_defs.h
      to facilitate function pointer typecasting in the datatype-specific
      packm_blk_var2() functions.
    - Deprecated _blk_var3.
    - Fixed a bug in the triangular micro-panel packing facility that
      affected trmm and trmm3 with unit diagonals.

commit c6793cecb70788bdf2c76ab8102504ea97be9d2a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 28 17:14:48 2014 -0500

    Reorganized #includes for scalar macro headers.
    
    Details:
    - Reordered the #include statements in bli_scalar_macro_defs.h so that
      conventional, ri-, and ri3-based macros are grouped together.
    - Renamed bli_eqri.h (and macros within) to end with 'ris' suffix.

commit b4da8907284345be4374f87a88679c4886ab866e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 28 14:10:32 2014 -0500

    Whitespace, comments updates on packm_blk_var?.c.

commit 46e46a1d83da586c3dd9fd7a01eb16067abbaee1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 28 12:05:45 2014 -0500

    Minor updates to packm blocked, cxk_3m/4m code.
    
    Details:
    - Added 'const' qualifier to inlined packing code that handles
      micro-panel packing that is too large for an existing packm ukernel.
    - Comment updates.

commit 908dc688b5979995eaacb3aa937f241551a8df00
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 28 11:55:12 2014 -0500

    Pass pack schema into blocked packm routines.
    
    Details:
    - Rather than passing the packm blocked routines a boolean value that
      represents whether the matrix is being packed to row or column storage,
      we now pass in the pack schema itself.

commit a0ff6066e06075ab5f92b19247b39b92ed15f1bf
Merge: c4c99c48 d40b32bc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Aug 24 15:56:21 2014 -0500

    Merge branch 'master' of github.com:flame/blis

commit c4c99c4813bf9817592a7899c5d33412fe22313f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Aug 24 15:52:22 2014 -0500

    Renamed packm scalar from beta to kappa.
    
    Details:
    - The packm implementation (i.e. sources files in frame/1m/packm and
      frame/1m/packm/ukernels), interchangeably used the names "beta" and
      "kappa" to refer to the optional scalar to be applied during packing.
      This commit renames all uses of "beta" to be "kappa", since "beta"
      sometimes evokes the scalar specifically on the output matrix of a
      level-2 or level-3 operation.

commit d40b32bc24ffbae24123e054307b3138969bb095
Merge: 9331f794 6c25c379
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Aug 24 13:46:36 2014 -0500

    Merge branch 'master' of github.com:flame/blis

commit 6c25c379fadb50834146e1614f7b80c093c2aad0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Aug 24 13:44:10 2014 -0500

    Consolidated unpackm ukernels into single file.
    
    Details:
    - Reorganized unpackm ukernels into a single file,
      bli_unpackm_ref_cxk.c, in a manner similar to what was done for packm
      ukernels in commit 4cc2b46.

commit 9331f79443223fe267676ee54c439e1ed320380c
Merge: 7fc48a7d 670b6392
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Aug 24 10:54:21 2014 -0500

    Merge branch 'master' of github.com:flame/blis

commit 670b63926a7f4fc694abc5b1582ef8a4f367f5a8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Aug 24 10:46:27 2014 -0500

    Added whitespace to bli_obj_scalar_ routine calls.
    
    Details:
    - Added extra spaces to align arguments of
      bli_obj_scalar_init_detached_copy_of(). This misalignment was due to
      the fact that the function was previously named
      bli_obj_init_scalar_copy_of() and the name change, performed in
      b444489f, was done via recursive sed commands which left subsequent
      lines untouched.

commit 7fc48a7d920e07fd8e9528ab2565123f8f4e67f9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Aug 23 16:50:58 2014 -0500

    Combined 4m/3m bits into an expanded bitfield.
    
    Details:
    - Combined the 4m/3m bits into an expanded bitfield, which will encode
      the packing "format" of the micro-panels. This will allow for more
      easily and compactly encoding additional formats.
    - Other minor comment/whitespace updates to bli_type_defs.h.
    - Updated bli_obj_macro_defs.h and bli_param_macro_defs.h to use the new
      format bitfield.
    - Comment update to bli_kernel_post_macro_defs.h.
    - Whitespace changes to bli_kernel_3m_macro_defs.h, _4m_macro_defs.h.

commit ef0143cc1417e4815e4cafd5a464cc83fe7a1e86
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Aug 23 14:02:27 2014 -0500

    Renamed _ri, _ri3 packm ukernels to _4m, _3m.
    
    Details:
    - Renamed packm ukernels, _cxk dispatcher, and structure-aware _cxk
      helper functions to use _4m and _3m instead of _ri and _ri3 suffixes.
    - Updated names of cpp macros that correspond to packm ukernels.

commit b0ccac116158b5ed3316d34798748ba0c6d78672
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 21 19:21:52 2014 -0500

    Cleaned up front-end layering for 4m/3m.
    
    Details:
    - Added an extra layer to level-3 front-ends (examples: bli_gemm_entry()
      and bli_gemm4m_entry()) to hide the control trees from the code that
      decides whether to execute native or 4m-based implementations. The
      layering was also applied to 3m.
    - Branch to 4m code based on the return value of bli_4m_is_enabled(),
      rather than the cpp macros BLIS_ENABLE_?COMPLEX_VIA_4M. This lays
      the groundwork for users to be able to change at runtime which
      implementation is called by the main front-ends (e.g. bli_gemm()).
    - Retired some experimental gemm code that hadn't been touched in
      months.

commit bedec95451cabfa7a8906b51018a5e0572998a5e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 21 18:25:48 2014 -0500

    Added bli_4m API for querying 4m enabled state.
    
    Details:
    - Added bli_4m.c (and header), which defines a simple API that can be
      used to query, enable, and disable 4m-based complex support in BLIS.
      The macros BLIS_ENABLE_?COMPLEX_VIA_4M are now used to initialize
      the variable that determines the state (enabled or disabled).
    - Changed bli_info*() API so that all cache and register blocksize-
      related query routines return the blksz_t objects' values as they
      exist at runtime, rather than return the values as determined by the
      configuration system (e.g. bli_kernel.h, or defaults for those values
      not specified). This sets the foundation for being able to change
      those blocksizes at runtime.

commit b541b667cabfa6d41b50ad1e49209651ee6812cc
Merge: 699a8151 dd61307f
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed Aug 20 14:44:51 2014 -0500

    Merge branch 'master' of http://github.com/flame/blis
    
    Conflicts:
            frame/3/trsm/bli_trsm_blk_var2b.c
            frame/3/trsm/bli_trsm_blk_var2f.c

commit 699a8151ca3d5021e834a1784ef45dcc3a3d17cd
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed Aug 20 14:43:17 2014 -0500

    Some improvements to trsm parallelism

commit dd61307f55bb6bc762fe0ef0446479d6c0536723
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Aug 20 09:52:16 2014 -0500

    Minor update to sandybridge MC_S, KC_S.
    
    Details:
    - Changed sandybridge MC and KC for single-precision real to 128 and 384,
      respectively.
    - Updated comments in template configuration's gemm micro-kernel file
      to document the new "contiguous row preference" macro.

commit d0eec4bddd740ce360d0f655362c551287cf925b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Aug 19 15:49:19 2014 -0500

    Added optional row preference to ukernel config.
    
    Details:
    - Added the ability for the kernel developer to indicate the gemm micro-
      kernel as having a preference for accessing the micro-tile of C via
      contiguous rows (as opposed to contiguous columns). This property may
      be encoded in bli_kernel.h as BLIS_?GEMM_UKERNEL_PREFERS_CONTIG_ROWS,
      which may be defined or left undefined. Leaving it undefined leads to
      the default assumption of column preference.
    - Changed conditionals in frame/3/*/*_front.c that induce transposition
      of the operation so that the transposition is induced only if there
      is disagreement between the storage of C and the preference of the
      micro-kernel. Previously, the only conditional that needed to be met
      was that C was row-stored, which is to say that we assumed the micro-
      kernel preferred column-contiguous access on C.
    - Added a "prefers_contig_rows" property to func_t objects, and updated
      calls to bli_func_obj_create() in _cntl.c files in order to support
      the above changes.
    - Removed the row-storage optimization from bli_trsm_front.c because
      it is actually ineffective. This is because the right-side case of
      trsm flips the A and B micro-panel operands (since BLIS only requires
      left-side gemmtrsm/trsm kernels), meaning any transposition done
      at the high level is then undone at the low level.
    - Tweaked trmm, trmm3 _front.c files to eliminate a possible redundant
      invocation of the bli_obj_swap() macro.

commit 4cc2b464f29cafbfef9295b073b857fe0752f710
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Aug 15 11:49:15 2014 -0500

    Reorganized packm ukernels.
    
    Details:
    - Previously, packm micro-kernels were organized by the implied register
      blocksize (panel dimension) assumed by the kernel, meaning conventional,
      ri, and ri3 variations of some micro-kernel size were housed in the same
      file. This commit reorganizes the micro-kernels so that all sizes reside
      in the same file for each format type (conventional, ri, and ri3).

commit fcc10054a11b6fc3976986f57feccf741596cbf6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Aug 13 12:32:06 2014 -0500

    Tweaks to gemm4m, gemm3m virtual ukernels.
    
    Details:
    - Fixed a potential, but as-yet unobserved bug in gemm3m that would
      allow undesirable inf/NaN propogation, since C was being scaled by
      beta even if it was equal to zero.
    - In gemm3m micro-kernel, we now avoid copying C to the temporary
      micro-tile if beta is zero.
    - Rearranged computation in gemm4m so that the temporary C micro-tile
      is accessed less, and C is accessed only after the micro-kernel
      calls. This improves performance marginally in most situations.
    - Comment updates to both gemm4m and gemm3m micro-kernels.

commit cdcbacc2fa871317c8e7ef961ecc6d70ab22dc34
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Aug 12 12:45:38 2014 -0500

    Removed redundant redef of packm ukr prototypes.
    
    Details:
    - Removed redundant macro code that redefined packm ukernel prototypes
      when the previous macro was already sufficient. This helps de-clutter
      the packm ukernel prototyping headers a little bit.

commit 82dac98d9032ccb598068a55ddf23d7898491e9e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Aug 12 12:36:25 2014 -0500

    Relocated packm ukernel #includes.
    
    Details:
    - Consolidated the #include statements for packm ukernel headers from
      bli_packm_cxk.h, bli_packm_cxk_ri.h, and bli_packm_cxk_ri3.h to
      bli_packm.h.
    - Comment/whitespace updates to bli_packm_blk_var3.c, _var4.c.

commit 7f77856e25aad5fc6f172ed3e57b6351804e31a4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Aug 12 12:20:15 2014 -0500

    Removed unused 4m/3m-related packm macro defs.
    
    Details:
    - Removed unused and unneeded s- and d-flavored macro definitions for
      packm ukernels related to the complex 4m and 3m methods, as
      implemented in BLIS.

commit bc1d86b2d4d436b1dfba2d0098501aaca9cbb8b5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 7 19:01:20 2014 -0500

    Sandy Bridge configuration, micro-kernel update.
    
    Details:
    - Minor updates to bli_config and bli_kernel.h for sandybridge
      configuration.
    - Renamed existing AVX intrinsic-based micro-kernel file to
      bli_gemm_int_d8x4.c.
    - Added new file, bli_gemm_asm_d8x4.c, which provides assembly-based
      gemm micro-kernels for single- and double-precision real.

commit 98ec95877a95242e159b2bf0c879115a59e4c6e2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 7 18:28:32 2014 -0500

    Corrected comment for _obj_is_[row|col]_stored().
    
    Details:
    - Fixed a mistake in the comments introduced in the previous commit for
      bli_obj_is_row_stored() and bli_obj_is_col_stored().

commit 43d5e419e1b424d2143817103dbee8ead797e8aa
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 7 18:20:40 2014 -0500

    Reverted _obj_is_[row|col]_stored() macros.
    
    Details:
    - Rolled back recent changes to bli_obj_is_row_stored() and
      bli_obj_is_col_stored() so that those macros now only inspect the
      strides (row or column). It turns out that the more sophisticated
      definitions introduced in a51e32e are not necessary, because these
      "obj" macros are virtually never used on packed matrices, and when
      they are, they can use bli_obj_is_[row|col}_packed() macros, which
      inspect the info bitfield.

commit 45692e3ad4b7e1d05ac4302398df4efce04b4284
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 7 13:21:15 2014 -0500

    Reverted some accidental changes.
    
    Details:
    - Reverted some changes that were unintentionally included in the
      previous commit (9526ce98). Thanks to Tony Kelman for pointing
      this out. (Note: a few select changes were not reverted.)

commit 9526ce98812be908bc4915f2849b657fb6ce1b49
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Aug 6 14:13:46 2014 -0500

    Updated copyright headers of emscripten configuration files.

commit 30833ed71d56f231ddba21e632bcbbc90b12a97c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Aug 6 12:12:03 2014 -0500

    Minor edits to configurations' make_defs.mk files.
    
    Details:
    - Redefined CFLAGS, CFLAGS_NOOPT, and CFLAGS_KERNELS so that CFLAGS_NOOPT
      is defined first and then the other two are defined in terms of
      CFLAGS_NOOPT. This textually cleans up the definitions and makes them a
      little easier to read.

commit 9d61afeae2ba70fe1df07e7546f6954ea83aed12
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Aug 4 16:01:59 2014 -0500

    CHANGELOG update (0.1.5)

commit bde56d0ecfd0ec20330fac290b91a6dca0cf94e9 (tag: 0.1.5)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Aug 4 16:01:58 2014 -0500

    Version file update (0.1.5)

commit 4c6ceea4be35d089630986eb5b959b9e97214077
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Aug 4 15:49:59 2014 -0500

    Added CBLAS compatibility layer.
    
    Details:
    - Added a new section in bli_config.h files of all configurations for
      enabling CBLAS support. (Currently, the default is for the CBLAS layer
      to be disabled.)
    - Added a directory, frame/compat/cblas, to house CBLAS source code. A
      subdirectory 'f77_sub' holds subroutine wrappers corresponding to
      subroutines found in CBLAS that allow calling some BLAS routines with
      the return value passed as the last argument rather than as an actual
      (function) return value. This was probably intended to allow CBLAS to
      avoid the whole f2c debacle altogether. However, since BLIS does not
      assume the presence of a Fortran compiler, we had to provide similar
      routines in C.
    - A script, integrate-cblas-tarball.sh, is included to streamline the
      integration of future revisions of the CBLAS source code.
    - The current tarball, cblas.tgz, that was used with the above script to
      generate the present set of CBLAS source code is also included.
    - Updated blis.h to include necessary CBLAS-related headers.

commit caab62dac0fb0bd0d674118f409c81680db94d29
Merge: 383631b5 db97ce97
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Aug 3 14:36:18 2014 -0500

    Merge pull request #19 from kevinoid/fix-install-perms-error
    
    Fix permissions error installing to non-owned directory

commit db97ce979b88c051922c2f946ce52d523c7a12c6
Author: Kevin Locke <kevin@kevinlocke.name>
Date:   Sun Aug 3 12:48:04 2014 -0600

    Fix permissions error installing to non-owned directory
    
    When installing to a directory which is not owned by the installing
    user, even when the user has write permission for the directory, the
    installation can fail with an error similar to the following:
    
    Installing libblis-0.1.4-7-sandybridge.a into /usr/local/lib/
    install: cannot change permissions of ‘/usr/local/lib’: Operation not permitted
    Makefile:658: recipe for target '/usr/local/lib/libblis-0.1.4-7-sandybridge.a' failed
    make: *** [/usr/local/lib/libblis-0.1.4-7-sandybridge.a] Error 1
    
    In the example case, the error occurred because the user attempted to
    install to /usr/local and /usr/local/lib is owned by root with mode 2755
    which the Makefile unsuccessfully attempted to change to 0755.
    
    Given that installing to /usr/local is likely to be quite common and the
    ownership/permissions are the default for Debian and Debian-derived
    Linux distributions (perhaps others as well), this commit attempts to
    support that use case by using mkdir rather than install to create the
    directory (which is the same approach as Automake).
    
    Signed-off-by: Kevin Locke <kevin@kevinlocke.name>

commit 383631b514c3d42b724640f57644eea276cc418c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jul 31 14:51:48 2014 -0500

    Redefined bit field macros with bitshift operator.
    
    Details:
    - Redefined many of the macros that define bit fields and bit values in
      the obj_t info field using the bitshift operator (<<). This makes it
      easier to reorder bit fields, or expand existing bit fields, or add
      new fields. The bitshifting should be evaluated by the compiler at
      compile-time.

commit 137143345dc93cc9a83da5ba88b25bac7502de86
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jul 31 12:12:45 2014 -0500

    Reimplemented unit blocksize fix in prev commit.
    
    Details:
    - Instead of inferring the storage format of the micro-panels from within
      the packm variants, we now pass in a bool_t value that denotes whether
      the packed matrix contains row-stored column panels or column-stored
      row panels. This value can then be tested more easily inside the main
      packm variant loop.
    - Renumbered pack_t schema values in bli_type_defs.h so that there are
      now five bits, each with different meaning:
      - 4: packed or not packed?
      - 3: packed for 3m?
      - 2: packed for 4m?
      - 1: packed to panels?
      - 0: stored by rows or columns?
    - Added new macros that test for status of above bits in schema bit
      subfield, and renamed some existing macros related to 4m/3m.

commit a51e32ec061941cd10119ea80115c82a40b1673f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 30 10:41:48 2014 -0500

    Fixed unit register blocksize brokenness.
    
    Details:
    - Fixed a breakdown in BLIS's ability to differentiate between row-stored
      and column-stored micro-panels when MR or NR is unit. When either
      register blocksize (or both) is equal to one, inspecting the strides of
      the affected packed micro-panel is no longer sufficient to determine
      whether the micro-panel is a row-stored column panel or a column-stored
      row panel (because both strides are unit). At that point, dimension
      information is necessary when invoking the bli_is_row_stored_f() and
      bli_is_col_stored_f() macros (and their "obj" counterparts). Thanks to
      Ilya Polkovnichenko for reporting this bug.
    - Added panel dimensions (m and n) to obj_t, which are set in
      packm_init() and then passed into the blocked variants to support the
      aforementioned update.

commit c2732272f0ac680a0ad19fa9db5d587398a1479a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 29 16:37:18 2014 -0500

    Removed old/unused packm variants.

commit b97fa9a5a70fe0123e5eebd999b947461d38445f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Jul 27 18:54:09 2014 -0500

    Minor usage update to build/bump-version.sh.

commit b18ba5f62d98629cdd519ff4c96fc67ec1a62fb9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Jul 27 18:52:05 2014 -0500

    Added missing 'bla_' prefix to r_imag(), d_imag().
    
    Details:
    - Added "bla_" to f2c functions r_imag() and d_imag(). Thanks to Murtaza
      Ali for pointing the mis-named functions.

commit af7a8e6c042cade452130a6729377f1a3ef4e19e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Jul 27 18:20:13 2014 -0500

    CHANGELOG update (0.1.4)

commit a7537071b152ecff671f8716595d37dc09e4fd51 (tag: 0.1.4)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Jul 27 18:20:12 2014 -0500

    Version file update (0.1.4)

commit acff74041bf02c7b9fdfa24b507bca782a4c5fce
Merge: cdb9413e 47b243ef
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed Jul 23 15:07:30 2014 -0500

    Merge branch 'master' of https://github.com/flame/blis

commit cdb9413e140f8a198666250ec88fa34b5425a9c3
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed Jul 23 15:05:15 2014 -0500

    Enabled threading for a couple more loops in TRSM
    
    JC loop is now enabled for the left-sided case
    IC loop is now enabled for the right-sided case

commit 47b243ef08f4101de3d936f2373343e67eaa4dd5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 23 13:41:13 2014 -0500

    Call setid for early return from herk/her2k.
    
    Details:
    - Added setid call (to zero imaginary parts of diagonal elements) to
      early return branches of herk_front() and her2k_front() for cases
      where alpha is zero. Thanks to Murtaza Ali for suggesting this fix.
    - Comment update.

commit 3e7b0db5b0e24f5fd66c60bacabc019885ddbec5
Merge: 2f8a357d ed3e33d5
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed Jul 23 13:40:44 2014 -0500

    Merge branch 'master' of https://github.com/flame/blis

commit 2f8a357de5fb55163a969d888cf059f24b78125c
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed Jul 23 13:40:12 2014 -0500

    Some TRSM threading fixes/additions

commit ed3e33d548047be3283ff41268fdf716563bc542
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 22 14:40:43 2014 -0500

    Tweaked behavior of herk, her2k for BLAS compat.
    
    Details:
    - Updated herk_front() and her2k_front() to explicitly set the imaginary
      components of the diagonal entries of C to zero after the computation
      is complete. This is needed in case downstream applications read the
      full diagonal entries (i.e., including imaginary part), which could, in
      the absence of this modification, accumulate numerical error from
      subsequent rank-k/rank-2k updates.
    - Updated BLAS compatibility wrappers for herk and her2k to return early
      if:
        n == 0 || ( ( alpha == 0 || k == 0 ) && beta == 1 )
      This also results in the imaginary components of diagonal entries NOT
      being set to zero (see above), which is consistent with BLAS.
    - Updated mkherm to use setid instead of an inlined loop over the
      diagonal.

commit ea59a5c93cde1467a3715abc53dda4aecf961873
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 22 14:36:02 2014 -0500

    Added new level-1d operation: setid.
    
    Details:
    - Defined a new level-1d operation, setid, which sets the imaginary
      elements of an object's diagonal to a single scalar. This can be
      useful, for example, when trying to make the diagonal of a Hermitian
      matrix real-valued.

commit 8965a965931318619ceaebd7c32edccf3022d0c7
Merge: 1785efb5 5b73e80b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 22 14:34:32 2014 -0500

    Merge branch 'master' of github.com:flame/blis

commit 1785efb5420bc7b9c850a068cb5d99837071e877
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 22 14:33:01 2014 -0500

    Minor improvements to invertd and setd.
    
    Details:
    - Added missing call to invertd_check() from front-end.
    - Changed setd front-end call of scald_check() to setd_check().

commit 5b73e80b71c054c1945a06aff044ef629bc1a9a0
Merge: a41e68e0 20690fe3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jul 18 12:21:20 2014 -0500

    Merge pull request #16 from Maratyszcza/emscripten
    
    Emscripten port

commit a41e68e09e73b999fab0bb430a43dccfc63aab45
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jul 17 13:25:56 2014 -0500

    Reimplemented BLIS initialization/finalization.
    
    Details:
    - Rewrote bli_init() and bli_finalize() with OpenMP critical sections
      for thread-safety. Also added lots of explanatory comments.
    - Renamed bli_init_safe() and bli_finalize_safe() with the _auto()
      suffix, and reimplemented for simplicity. Updated all invocations
      in BLAS compatibility layer to use _auto() suffix.

commit 36358948ea75074bda32a9f8c008f835b87d21db
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jul 17 10:58:10 2014 -0500

    Retired frame/3/gemm/other directory.
    
    Details:
    - Removed frame/3/gemm/other directory, which contained some outdated
      and/or experimental variants.

commit c73261f17edf589e76bdbe297702a1fbbd69275f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jul 14 16:23:51 2014 -0500

    More minor cleanups post-copyright update.

commit 2a09d24463d358be6243b24f112fad057c2aefe0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jul 14 16:17:09 2014 -0500

    Reverted power7 symlinks destroyed by sed script.
    
    Details:
    - Reverted two symlinks, in kernels/power7/3/test, back to being symlinks
      after recursive-sed.sh mistakenly replaced them with copies of the
      actual files to which they referred. Meant to include this in previous
      commit.

commit 7ed415824d3b2e78541b6f64e404ca5347c06d3d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jul 14 16:14:33 2014 -0500

    Updated copyright headers (continued).
    
    Details:
    - Inserted "at Austin" into third clause of license declarations.
      Meant to include this change in previous commit.

commit 5c2c6c85616834ff2716ece083118201d9df6dde
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jul 14 16:05:03 2014 -0500

    Updated copyright headers to contain "at Austin".
    
    Details:
    - Updated copyright headers to include "at Austin" in the name of the
      University of Texas.
    - Updated the copyright years of a few headers to 2014 (from 2011 and
      2012).

commit fcec68cda3f6e90ae055e7304e6674c1c5c8d010
Merge: 94c0df79 4a20ed1a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jul 14 11:35:34 2014 -0500

    Merge branch 'master' of github.com:flame/blis

commit 94c0df797eda377931f29a41ba6a89c0ed58daca
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jul 14 11:24:36 2014 -0500

    Changed order of zero dim / error checking.
    
    Details:
    - Updated level-2 and level-3 internal back-ends so that the operation's
      _check() function is called BEFORE any attempt to return early due to
      the presence of zero dimensions. This ordering makes more sense because
      (for example) object dimensions should match even if one of them is
      zero. Previously, a dimension mismatch could result in an early return
      with no error message.
    - Updated bli_check_object_buffer() so that NULL buffers result in an
      error only if the object is dimensionally non-empty (i.e., only if both
      of the object's dimensions are non-zero). This allows BLIS operations
      to be performed on dimensionally empty objects (i.e., where at least one
      dimension is zero).
    - Updated the error message associated with bli_check_object_buffer()
      to mention the newly relaxed constraint mentioned above, vis-a-vis
      non-zero dimensions.

commit 20690fe3018ce17c8df61ce0bffecaa7911dc3a5
Author: Marat Dukhan <maratek@gmail.com>
Date:   Sun Jul 13 22:50:56 2014 -0700

    Emscripten port

commit 4a20ed1a3f5e9e5232df30aa0e568e6c00c56ce1
Merge: 6a515e98 8ccdfaef
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Jul 13 17:45:01 2014 -0500

    Merge pull request #14 from Maratyszcza/master
    
    Support "make test" for PNaCl configuration

commit 6a515e988f2ae1628258a6dec2c0e9cf2d04790f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Jul 13 17:38:33 2014 -0500

    Implemented dsdot() and sdsdot() in compat layer.
    
    Details:
    - Replaced "not yet implemented" error messages in dsdot() and sdsdot()
      with actual implementations. (These routines are so rarely used that
      this log message will probably lead to some people learning of their
      existence for the first time.)

commit 255668ddd1004552c6cc65035ec6486671ce99bb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Jul 13 17:30:44 2014 -0500

    Inserted gemv beta-scaling bug into compat layer.
    
    Details:
    - BLAS has a peculiar bug (or feature) whereby calling gemv on a vector
      y of non-zero length and a vector x of zero length results in no action.
      Given that the operation is y := beta*y + A*x, many (most?) individuals
      would expect vector y to still be scaled by beta. BLIS, when called
      natively, handles these cases intuitively (with beta scaling).
      Unfortunately, many BLAS test suites actually check for the way this
      situation is handled. Therefore, we have decided to implement this "bug"
      in the compatibility layer so as to provide "bug-for-bug" compatibility
      with BLAS.

commit 570a154581bdb353fa13a219c7cb3c81d3dceffd
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jul 12 17:51:05 2014 -0500

    Comment/formatting updates to build scripts.
    
    Details:
    - Minor updates to comments and formatting in bump-version.sh and
      update-version-file.sh scripts.

commit 26cd81990631ff799791629206e068126ff9e3a1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jul 10 13:16:07 2014 -0500

    Added bli_info_*() query functions.
    
    Details:
    - Added a new API family, bli_info_*(), which can be used to query
      information about how BLIS was configured. Most of these values are
      returned as gint_t, with the exception of the version string which
      is char*.
    - Changed how the testsuite driver queries information about how BLIS
      was configured (from using macro constants directly to using the
      new bli_info API).
    - Removed bli_version.c and its header file.
    - Added STRINGIFY_INT() macro to bli_macro_defs.h
    - Renamed info_t type in bli_type_defs.h to objbits_t (not because of
      an actual naming conflict, but because the name 'info_t' would now be
      somewhat misleading in the presence of the new bli_info API, as the
      two are unrelated).

commit 970b43141697d8c31a033f59513bb59d7cc78ab0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jul 10 09:30:00 2014 -0500

    Minor bugfixes to BLAS compatibility layer.
    
    Details:
    - Changed bla_amax.c so that i?amax() routines now correctly return 0
      if ( n < 1 || incx <= 0 ).
    - Changed bla_rotg.c and bla_rotmg.c to use bli_fabs() macro instead of
      f2c's abs() macro for float and double cases.
    - Thanks to Murtaza Ali for suggesting the two fixes above.
    - Updated label of fnormv to normfv in testsuite/input.operations.

commit 8ccdfaef4c42ad8957af8607a1a9ee29b9277d4b
Author: Marat Dukhan <maratek@gmail.com>
Date:   Tue Jul 8 23:14:36 2014 -0700

    Replicated logic from testsuite/Makefile in top-level Makefile to support make test

commit caa6507ff3724c80d60987f309b8bbc5b50a9841
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 8 10:25:27 2014 -0500

    Minor cleanup to standalone test drivers.
    
    Details:
    - Very minor code changes to standalone test drivers in 'test' directory.
    - Added *.so files to '.gitignore'.

commit 6c65e9a58fe55990ebb99ec3986443e18af35338
Merge: cb12e456 daca500d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 8 10:13:49 2014 -0500

    Merge branch 'master' of github.com:flame/blis

commit cb12e456f94c196c093e52f02a7cbca0032fc86e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 8 10:07:46 2014 -0500

    Fixed possible level-3 inf/NaN issue when beta=0.
    
    Details:
    - Redefined xpbys_mxn and xpbys_mxn_u/_l macros to employ a copy
      (instead of scaling by beta) when beta is zero. This will stamp out
      any possible infs or NaNs in the output matrix, if it happens to be
      uninitialized. Thanks to Tony Kelman for isolating this bug.

commit daca500db5e2448ba0da8047b75eb0f88d9f40e3
Merge: ab3bc915 47023502
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Thu Jul 3 12:52:52 2014 -0500

    Merge branch 'master' of http://github.com/flame/blis

commit 4702350278af31f662b458127777dd4d85a3192f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jul 3 11:48:23 2014 -0500

    Defined _ukernel_void() wrappers to micro-kernels.
    
    Details:
    - Added wrappers for micro-kernels so that users may invoke the
      micro-kernels without knowing what the function names actually are.
      This is useful when an application wishes to call the micro-kernel
      from a shared library instance of BLIS, where the application may not
      necessarily have the luxury of grabbing the micro-kernel name(s) from
      C preprocessor macros at compile-time. Also, since the wrappers use
      void* pointers, one's environment does not need to be aware of some
      BLIS types such as scomplex and dcomplex. These wrappers now join the
      level-1 and level-1f kernel wrappers, which pre-dated this commit.
    - Removed the wrapper definitions and prototypes from the micro-kernel
      test suite modules, and replaced calls to them with calls to the new
      wrappers mentioned above.

commit ab3bc9153b914fbaf259e15b66c91d628e7c8661
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Thu Jul 3 11:19:43 2014 -0500

    Fixed a bug for TRSM when BLIS_ENABLE_MULTITHREADING is not set but the multithreading environment variables are turned on

commit b8134b720b985783ee6a582a3eb5d6c51f00d051
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed Jul 2 16:02:39 2014 -0500

    Quick and dirty multithreading for TRSM
    
    Should work fine for small number of threads (up to 8 or maybe even 16).
    However, performance is yet untested.
    This parallelizes the "JR" loop for the left sided cases
    and the "IR" loop for the right sided cases.
    
    Future work is to parallelize the outer loops as well.

commit e8ef69692831db07ddbe9485a5e504ac3f03e496
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 2 14:59:27 2014 -0500

    Added shared library support to build system.
    
    Details:
    - Modified top-level Makefile to support building shared (dynamic)
      libraries.
    - Updated most configurations' make_defs.mk files to include necessary
      compiler/linker flags needed by top-level Makefile.
    - Note that by default, all configurations presently do NOT build
      shared libraries. To enable, one must change the value of
      BLIS_ENABLE_DYNAMIC_BUILD to 'yes'.

commit b80df0f2cffb015da02e70a82b8512da9891ab67
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 23 13:52:39 2014 -0500

    Added bump-version.sh script to 'build' directory.
    
    Details:
    - Added a bash script, bump-version.sh, to aid in incrementing the BLIS
      version string.

commit 9ef1f1e21d083697fc730e48d7d9169c201f3da2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 23 13:48:17 2014 -0500

    CHANGELOG update (0.1.3)

commit 036cc634918463b1caa0fd89c9a211f2f5639af7 (tag: 0.1.3)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 23 13:48:17 2014 -0500

    Version file update (0.1.3)

commit 09d9a3bf6763932d9f571085b2cfd1b8631eccba
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 23 13:43:26 2014 -0500

    Reverting version file to test new version script.
    
    Details:
    - Changed version file contents to 0.1.2 so that I can test out a new
      version file bumping script.

commit ebb33965981dcb2b0bdee5fc7fdf6c959420f311
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 23 11:22:50 2014 -0500

    Added 'version' file.

commit 2cb9a5501a3cbeb6692cf68e896087ba73b6af69
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 23 10:42:29 2014 -0500

    Removed 'version' from .gitignore file.

commit b40dcefc5ee31f67aa3990e2e9d2ef8ed1386a25
Merge: 7101a8ee b693b0cd
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 23 10:39:05 2014 -0500

    Merge pull request #11 from Maratyszcza/stable
    
    [sc]axpy kernels for PNaCl

commit b693b0cddcfb41450e3c09a3ab97acb44c1ccdec
Author: Marat Dukhan <maratek@gmail.com>
Date:   Sun Jun 22 13:44:25 2014 -0700

    [SC]AXPY kernels for PNaCl

commit 7101a8eec0327d6c3a7eb36eb4b0fd45c1c6d162
Merge: ad48dca2 020a831b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jun 19 21:46:50 2014 -0500

    Merge pull request #10 from Maratyszcza/stable
    
    Portable Native Client port

commit 020a831bc5f61744cb8354886aa679b99b1285f6
Author: Marat Dukhan <maratek@gmail.com>
Date:   Thu Jun 19 00:58:26 2014 -0700

    Code clean-up in PNaCl port

commit 491be4f91ed725522f5cc7184053857c6c376ada
Author: Marat Dukhan <maratek@gmail.com>
Date:   Thu Jun 19 00:45:44 2014 -0700

    Optimized dot product kernels for PNaCl

commit 4b8e71aab80182873a2e138eb07902b8d8fd5480
Author: Marat Dukhan <maratek@gmail.com>
Date:   Thu Jun 19 00:43:25 2014 -0700

    Use AR rcs flags for PNaCl target to avoid warning

commit 031deb2a5c718d569bde842590a791b812f4cf1d
Author: Marat Dukhan <maratek@gmail.com>
Date:   Wed Jun 18 03:11:34 2014 -0700

    PNaCl configuration: use pnacl-ar instead or ar (fixes build issue on Mac)

commit 68a02976e3c3638f0a9821342e269a1743e3ace3
Author: Marat Dukhan <maratek@gmail.com>
Date:   Wed Jun 18 03:10:25 2014 -0700

    Compile pnacl configuration in GNU11 mode to avoid warning about non-standard features

commit 6f8462eb0ec278b89731e73ef583386a3371d095
Author: Marat Dukhan <maratek@gmail.com>
Date:   Wed Jun 18 03:08:46 2014 -0700

    Fix inconsistent VERBOSE macro in Makefile

commit b2ffb4de8b6872cb23537ad282e557d11dcd9c8b
Author: Marat Dukhan <maratek@gmail.com>
Date:   Sun Jun 15 18:41:30 2014 -0400

    Reformatted PNaCl GEMM kernels

commit 6de2d472d98baa215264a776f3d5291780a6a085
Author: Marat Dukhan <maratek@gmail.com>
Date:   Sun Jun 15 08:44:31 2014 -0400

    CGEMM and ZGEMM kernels for PNaCl

commit f064711a5e6fb3852c17c7520909b09dc27665f2
Author: Marat Dukhan <maratek@gmail.com>
Date:   Sun Jun 15 06:27:37 2014 -0400

    SGEMM and DGEMM kernels for PNaCl

commit ad48dca22913a363899f0bef45553898718eebb1
Merge: ee2b6792 7118f87e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jun 14 15:10:13 2014 -0500

    Merge pull request #9 from tkelman/memalign_windows
    
    Use _aligned_malloc instead of posix_memalign on Windows

commit 7118f87e18b4941423472afc00215c1d1f2a1fcd
Author: Tony Kelman <tony@kelman.net>
Date:   Sat Jun 14 06:53:20 2014 -0700

    Use _aligned_malloc instead of posix_memalign on Windows

commit ee2b679281ca45fb40b2198e293bc3bc3d446632
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Jun 6 12:41:55 2014 -0500

    Only include omp.h if BLIS_ENABLE_OPENMP is set

commit 19c05dfaac43c627f86e897c8c00f1f9440754aa
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jun 5 10:54:16 2014 -0500

    CHANGELOG update (for 0.1.2).

commit 00f232f8ed1f7c41619b12ebf779ebe2c3b2d3cd (tag: 0.1.2)
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Mon Jun 2 13:40:57 2014 -0500

    Added single-precision micro-kernel for Knights Corner aka MIC aka Xeon Phi

commit 3fc60e491426f6248c0feae88d971e4d1f88fb95
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed May 21 11:34:42 2014 -0500

    Fixed ldim alignment bug in core2 gemm ukernel.
    
    Details:
    - Fixed a bug in the dunnington/core2 gemm micro-kernels that resulted in
      a segmentation fault if a column-stored matrix's starting address was
      aligned, but its leading dimension was such that its second column was
      unaligned. Basically, the micro-kernel was assuming that aligned load
      instructions were safe when they actually were not. An extra condition
      that checks the alignment of cs_c (ie: the leading dimension in the
      column storage case) has now been added. Thanks to Michael Lehn for
      reporting this bug.

commit 77a2d8dac8b242d7a202c9aabda3927ab68cf987
Merge: 8c5d6071 21fb0893
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 20 09:53:19 2014 -0500

    Merge pull request #8 from tlrmchlsmth/master
    
    Added multithreading to most level-3 operations.

commit 21fb089387ee7c87f6dc53b0f60f68b48d3ff3e8
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Mon May 19 20:38:55 2014 -0700

    Reverting changes dunnington and reference configs
    
    Now they are unchanged from the main branch of BLIS

commit 8a0ef0e0db5880730425926f8ba56b457a2ba764
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri May 16 13:44:14 2014 -0500

    Fixed rounding error in bli_get_range_weighted

commit 0b4b1680334528b1b60bc696537600f763198e92
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri May 16 12:23:37 2014 -0500

    Fixed bug with disabling JC loop threading for right sided trmm

commit 5c048a90d8dfa1dbde4e45fbc10ffcbdfe59d960
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed May 14 16:20:06 2014 -0500

    Disabled parallelism for right-sided TRMM JC loop
    
    The loop has dependent iterations.

commit 13a4c717ed0e273359dbaf5554cc4fa70b087d71
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed May 14 14:59:04 2014 -0500

    Fixed bug with bli_get_range_weighted

commit 45957cc7745e9bb1698408d72f53ef192e960820
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Tue May 13 17:14:46 2014 -0500

    Allowed threading to be turned off
    
    No longer requires OpenMP to compile
    Define the following in bli_config.h in order to enable multithreading:
    BLIS_ENABLE_MULTITHREADING
    BLIS_ENABLE_OPENMP
    
    Also fixes a bug with bli_get_range_weighted

commit bd1dc98ce599d74513a553fe3b37a2ebca1c3812
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Mon May 12 17:26:19 2014 -0500

    Disabled multithreading of the kc loop

commit 456df0372170bd7ca2c7e2d85365a69f1f04de88
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed Apr 30 12:28:00 2014 -0500

    Replaced register blocksize hack with querying the register blocksize for determining parallelism granularity

commit f4fdfe8fc573553eb36795b79cdf681270dab71b
Merge: 31bb065b 8c5d6071
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed Apr 30 11:46:35 2014 -0500

    Merge http://github.com/flame/blis

commit 8c5d6071e24ba10a53669390a47287e86ff354ce
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 29 12:26:12 2014 -0500

    Added _check() routines for fprint[mv], rand[mv].
    
    Details:
    - Added _check() routines for fprintm, fprintv, randm, and randv.
    - Added invocations to the above routines from their respective
      front-ends.

commit 262cdabcc885bcf6636f4d8bb7d320f95e81d820
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 28 16:48:25 2014 -0500

    Changed treatment of NULL object buffers.
    
    Details:
    - Relaxed the constraint in bli_obj_attach_buffer_check(), which required
      the buffer address being attached to be non-NULL. This is acceptable
      because the user was already able to create and use objects with NULL
      buffers (via bli_obj_create_without_buffer(), which initializes the
      buffer to NULL).
    - Inserted calls to newly defined function, bli_check_object_buffer(),
      into nearly all operations' _check() or _int_check() functions. This
      allows BLIS to abort peacefully if a computational routine is called
      with an object containing a NULL buffer. By contrast, under such
      conditions, BLAS would typically fail with a segmentation fault.
    - Within operation front-ends, moved the calls to _check()/_int_check()
      so that zero dimensions are checked first (and if found, execution
      returns with trivial or no computation). This resolves issue #7. Thanks
      to Jack Poulson for reporting this bug.

commit 31bb065ba40ae0c5a614e743b8025abca012b99e
Merge: 20e24430 7c619599
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed Apr 23 12:30:19 2014 -0500

    Merge http://github.com/flame/blis

commit 7c61959955c8ba78160d0ed4d1979022029d963b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 10 17:18:36 2014 -0500

    Can now query register blocksizes from blk algs.
    
    Details:
    - Added a new field to blksz_t objects that allows one to attach a
      sub-object. Doing this allows us to associate a register blocksize with
      any given cache blocksize. That way, the register blocksize can be
      queried wherever the cache blocksize would normally be accessible
      (e.g. a blocked algorithm).
    - Modified bli_gemm_cntl.c (and 4m/3m variants) so that the register
      blocksizes are attached to the cache blocksizes after they are created.

commit 58671597d3d450817b2eda576c05ed6dadd8af6d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 10 15:35:30 2014 -0500

    Minor cleanups to level-2 _cntl.c files.
    
    Details:
    - Changed level-2 _cntl.c files so that the blocksizes for gemv are
      imported and used, rather than blocksizes being declared locally.
    - Whitespace changes to gemv_cntl.c and gemm_cntl.c files (as well as
      4m/3m variants).
    - Removed test/old/test_blis2.c.

commit 20e24430a772bc0fbaf24dec2f8c544096fd3f4e
Author: Tyler Michael Smith <tmsmith@vestalac1.ftd.alcf.anl.gov>
Date:   Tue Apr 8 17:50:44 2014 +0000

    Some fixes for the bgq kernels

commit bde697f75ec1e7f2decebee0c9bd620b4c134cd5
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Apr 4 16:43:44 2014 -0500

    Add -openmp to ldflags as well

commit c332be8cd471eeace7b4fa4ae7443088b6a68ec3
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Apr 4 16:37:50 2014 -0500

    Added -openmp flag to Xeon Phi build for convenience

commit e7ca9e4b4a24d585c9aec8293fc7bb79e4171ad0
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Apr 4 16:31:15 2014 -0500

    Used BLIS_DEFAULT_*_MR for rounding partitioning instead of BLIS_DEFAULT_*_MC

commit 7b9b228c6fa4cfb70b1ebb855b009a036e85fac3
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Apr 4 16:29:10 2014 -0500

    Fix for tree barrier freeing bug

commit 5ec93bd9a76096312d51c326ccde1e9bd0a436ab
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Apr 4 15:09:10 2014 -0500

    Bunch of minor fixes
    
    Removed barrier after unpackm in all level3 blocked variants
    Now there is an implicit barrier inside unpackm that only occurs if C is packed (which is usually not the case)
    
    Moved the enabling of the tree barriers into bli_config.h
    Fed the default MR and NR for double precision into bli_get_range instead of the number 8

commit 575fb9b0b08f3bdb56ccde056da619d1585617c1
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Apr 4 12:13:29 2014 -0500

    Changed default blocking factor to default double precision MR and NR

commit ab9c7880335c281432d5809fe0dec46753d22569
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Apr 4 11:38:11 2014 -0500

    Added faster tree barriers necessary for performance for Xeon Phi
    
    Fixed up some stuff in the thread info free functions
    Disabled threading for TRSM so that it actually works when threading environment variables are set

commit ec58a7923cccac08632670caadf3cf6ff5dce766
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Apr 4 10:22:48 2014 -0500

    Freeing thread info paths.
    
    Also made herk IC and JC loops do weighted partitioning

commit 2b6848b2397d6d84ca4e5f792fc51ad05e351a36
Merge: 4e3eb39a 21a0efb3
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Apr 4 09:54:54 2014 -0500

    Merge http://github.com/flame/blis
    
    Conflicts:
            kernels/bgq/1/bli_axpyv_opt_var1.c
            kernels/bgq/1/bli_dotv_opt_var1.c

commit 4e3eb39aca4df0b9fdc003d468f368a2f2ba597d
Author: Tyler Michael Smith <tmsmith@vestalac1.ftd.alcf.anl.gov>
Date:   Fri Apr 4 14:50:03 2014 +0000

    Some fixes to the bgq config
    MR and NR for double complex were wrong
    Default fusing factor for double precision was wrong as well

commit 21a0efb33d7435139e9c43c1a4787a6bff533e26
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 3 16:38:44 2014 -0500

    Fixed follow-up to issue #6.

commit c318157a9bee8ea6e59be16f99f65d9271fe0d27
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 3 16:24:34 2014 -0500

    Fixed issue #6 (incorrect 'restrict' usage).
    
    Details:
    - Fixed improper usage of restrict keyword in axpyv and dotv bgq kernels.
      (However, there may be other instances of similar misuse elsewhere in
      BLIS.) Thanks to Jeff Hammond for reporting this issue.

commit b5150a1bf3bd89598e2b3aeac110eb5b44ac6c12
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 3 12:25:45 2014 -0500

    Added #include "arm_neon.h" to ARM gemm ukernel.
    
    Details:
    - Inserted #include "arm_neon.h" into gemm ukernel source file for
      arm/neon. Thanks to Jean-Michel Hautbois for suggesting this fix.

commit 2041c264517b6c590fd4f7e8253e6911b622d1c3
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Thu Apr 3 10:30:03 2014 -0500

    Added barriers needed prior to doing scalar reset for rank-k updates.

commit 47a90e69dfde3f4f8fdf90654248a6b499fbadbc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 1 14:34:31 2014 -0500

    Attempted to fix uninitialized variable warnings.
    
    Details:
    - Added initialization statements to various macros used in level 1m and
      1m-like operations. I wasn't able to reproduce the reported behavior,
      so hopefully this takes care of it. Thanks to Jeff Hammond for the
      report.

commit d27b4f690c14b1f836f8c7a3c0e91e09d852f02e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 1 12:57:24 2014 -0500

    Use generic paths for toolchain in POWER7.
    
    Details:
    - Fixed issue #4. Thanks to Jeff Hammond for contributing changes.

commit 1584ae1c83c3a8c1af76acb46404747507650f19
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Mar 28 15:15:48 2014 -0500

    Fixed race condition involving scalar reset

commit 459dde4acc09e49380da58fb7b246db488884ad9
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Thu Mar 27 17:06:45 2014 -0500

    Made barrier after packing implicit.
    
    This also fixed a bug where barriers in the blocked variants were inserted after the inner packing routines,
    but not the outer packing routines.
    This allowed, for instance, the block of B to not be finished being packed before computation to occur.

commit 9f78ec6e7e95fcad89a167b27cad7e2d74b6d122
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Thu Mar 27 14:18:46 2014 -0500

    Some fixes for the internal functions,
    was innappropriately only having thread chief do some things.

commit a6fd48345424e097f71652be013aa897e098b41e
Author: Tyler Michael Smith <tmsmith@vestalac1.ftd.alcf.anl.gov>
Date:   Wed Mar 26 17:19:46 2014 +0000

    Added test drivers for level 3 BLAS that run tests in parallel using MPI

commit 73b3db594864be0f9be9a0eb29bf961fa9c95f29
Author: Tyler Michael Smith <tmsmith@vestalac1.ftd.alcf.anl.gov>
Date:   Wed Mar 26 15:39:05 2014 +0000

    Some fixes for the bgq configuration

commit f0824a04fc75e231c3a3d7757fa4e7294173282f
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Mon Mar 24 15:21:42 2014 -0500

    Initial commit to enable threading in TRSM,
    
    Also enabled weighted partitioning for herk, trmm
    Fixed bug where multiple threads would try to modify the same state in the internal level 3 functions
    Correctly computed a_next and b_next for gemm, herk macrokernels
    a_next and b_next point to the current micropanels in trmm

commit 23d9eab354fbc88165889832955e126772bf8488
Merge: 5d5dc2ee fd3e32a5
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Thu Mar 20 16:54:35 2014 -0500

    Merge https://github.com/flame/blis

commit 5d5dc2eedef2f7c90d61371a1b457be5c06cf583
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Thu Mar 20 16:43:36 2014 -0500

    Parallelized trmm and trmm3
    
    Also fixed bugs in packm

commit fd3e32a5f419fa412f46afe4dd1c3a26e15f3eb4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Mar 20 13:59:48 2014 -0500

    Refined INSERT_GENTFUNC macro usage.
    
    Details:
    - Defined new INSERT_GENTFUNC macros so that the macro always takes
      exactly the number of arguments needed for the particular operation or
      variant being defined. Many operations were using INSERT_GENTFUNC
      macros that expected one auxiliary argument even though none were
      needed. Those instances have now been updated. Most of these instances
      were in the level-0 and -1v operations, as well as some operations
      defined in frame/util.

commit 9b0e715f29338a1a1d6445907d2445c35f011121
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Mar 19 15:47:54 2014 -0500

    Minor simplifications to trmm, trsm macro-kernels.
    
    Details:
    - Simplified some code that would have allowed the diagonal of a trmm
      or trsm triangular matrix to intersect the short end of a micro-panel.
      This is disallowed via higher-level constraints on cache blocksizes, so
      this code was never needed and only served to obfuscate.
    - Updated some comments in trmm, trsm macro-kernels.

commit a3902750b9ab4923433f7e353f3669c3c419f8e4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Mar 19 12:35:17 2014 -0500

    Reorganized norm operations.
    
    Details:
    - Completely reoganized norm operations:
      - Renames:
        - fnormsc, fnormv, fnormm -> normfsc, normfv, normfm (2-norm)
        - absumv -> norm1v (vector 1-norm)
      - New operations:
        - norm1m (matrix 1-norm)
        - normiv, normim (infinity-norm)
        - amaxv (BLAS-like absolute maximum value index)
        - asumv (BLAS-like absolute sum)
    - Deprecated absumm, as it did not correspond to any actual norm.
      (However, an inlined version now exists in the testsuite module for
      randm.)

commit c0140cb752f27e99742f85d23be2181c00a1335e
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed Mar 19 11:21:16 2014 -0500

    Fixed packm variants 3 and 4 where every thread was trying to manipulate the same state
    
    Now just performed by the master thread.

commit fb42983bd9943711baa7d1c6496de1215bb816ef
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Tue Mar 18 16:37:28 2014 -0500

    Fixed a barrier bug and a thread decorator bug

commit aa2405f8b23d0f8d2ec04790882f2176ef2e8fd8
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Tue Mar 18 15:23:09 2014 -0500

    Fixing function pointer issues with thread decorator

commit ec8b88f93533942d3711191873310e7ff281bda6
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Tue Mar 18 14:35:37 2014 -0500

    Enabled threading for packm blocked variants 3 and 4

commit 0ac534cdf657bbf04601abfe719ba2887aab5da7
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Tue Mar 18 13:26:27 2014 -0500

    Added decorator for calling parallelized intermal functions
    
    Will allow for easy support for different threading models

commit 5296f58975f7d351f88909cc80b6d0cffd73def7
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Mon Mar 17 17:15:35 2014 -0500

    Fixing some bugs with herk parallelization

commit c51d0110831eb89361b4720bf7ed75edbd26ebce
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Mon Mar 17 15:00:47 2014 -0500

    Initial multithreading support for HERK

commit c720b141568d1f289146bf34ded08001f2c0dfbb
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Mon Mar 17 11:39:32 2014 -0500

    Switched to using environment variables to control threading.
    
    The environment variables all follow the format BLIS_X_NT,
    where X is the index of the loop as described in our paper
    Anatomy of High Performance Many-Threaded Matrix Multiplication.
    These indices are IR, JR, IC, KC, and JC.
    
    Also enabled parallelism for hemm and symm, but these are currently untested.

commit 92233cf64274b27b2217c5cfffe75443ff6137a4
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Tue Mar 11 14:16:08 2014 -0500

    Some fixes to gemm thread info tree creation,
    Changed microkernel tests to use the new BLIS_PACKM_SINGLE_THREADED
    instead of BLIS_SINGLE_THREADED

commit 020f80c30289d8bcaa688bf600b01fae9b23b54f
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Tue Mar 11 12:08:17 2014 -0500

    Added files specific to threading for gemm and packm operations

commit 8d8f4352a41926bc923e47be836365b6b726aff2
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Mon Mar 10 15:47:28 2014 -0500

    Added single threaded thread info data structures specifically for gemm and packm

commit 0e8677761175189583ca7d855e24b2bbdd2dada8
Merge: 2e727a02 b3bff631
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Mon Mar 10 15:16:21 2014 -0500

    Merge branch 'master' of https://github.com/tlrmchlsmth/blis

commit 2e727a025a8f796d2b6bd14f489d0ee72e7d1fc7
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Mon Mar 10 15:14:33 2014 -0500

    Modifying the thread info data structures
    
    This change makes each operation have its own thread info type,
    allowing more fine control of threading in operations that have different types of suboperations

commit a770590cf21a459f04bf941c58ee2afd272cc441
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Mar 3 14:31:44 2014 -0600

    Minor fixes to sumsqv, abmaxv.
    
    Details:
    - Minor update to bli_sumsqv_unb_var1() to bring it up-to-date with
      LAPACK 3.5.0's zlassq.f, which, starting with 3.4.2, returns NaN when
      the vector (or matrix) contains a NaN.
    - Minor change to bli_abmaxv_unb_var1() to more closely mimic the
      behavior of netlib BLAS's izamax(). There, a "less than or equal to"
      operator is used in the search instead of "less than", which would
      change the element index returned if there were multiple maximum values.
    - Added macro function definitions for bli_isinf() and bli_isnan(), which
      are currently implemented in terms of isinf() and isnan() from math.h.

commit b3bff631eadf98b15cb422fb4a8e2f855c23e8a7
Merge: 2c158fb8 e8757b03
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Thu Feb 27 16:53:24 2014 -0600

    Merge https://github.com/flame/blis

commit 2c158fb885c27f7b599dc1e85b57edd684f19223
Merge: e4738c48 c2b2ab62
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Thu Feb 27 16:46:23 2014 -0600

    Merge https://github.com/flame/blis
    
    Conflicts:
            frame/1m/packm/bli_packm_blk_var1.c

commit e8757b03a74f9891632242e9a90efb32150826f5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 27 16:40:07 2014 -0600

    Use "%ld" as int format specifier in fprintm.
    
    Details:
    - Changed "%d" to "%ld" when printing integers via bli_fprintm().
    - Meant to include this in previous commit.

commit c663ce3b5170fee7dfb5b528b650d70c8e932cac
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 27 16:32:57 2014 -0600

    Fixed various bugs when C99 complex is enabled.
    
    Details:
    - Fixed various bugs in packm_*_cxk(), the 4m/3m micro-kernels, and
      elsewhere in the framework that were not yet set up to work properly
      when BLIS_ENABLE_C99_COMPLEX is defined in bli_config.h
    - Extensive changes to f2c-derived files in frame/compat/f2c to allow
      C99 complex storage. Most of these changes center around accessing
      real and imaginary components via bli_?real()/bli_?imag() accessor
      macros, and setting of values via bli_?sets() assignment macros.
      (Thanks to Vladimir Sukarev for pointing out that _ENABLE_C99_COMPLEX
      was broken.)

commit e4738c48e00b89391d9baa1fd0aa62d1ea2f95e6
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Thu Feb 27 16:29:46 2014 -0600

    Added support for parallelism in gemm micro-kernel

commit bfe214b633765ed40b57b330fbb84c332663aa40
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Thu Feb 27 15:53:10 2014 -0600

    Fixed bug with parallel packing, and bug with allocating an array of thread infos
    
    In packm variant 1, the variable p_begin was incremented each iteration, causing a dependency.
    This dependeny was removed, allowing each iteration to be executed in parallel.
    
    Somewhere in bli_threading.c, I was allocating an array of pointers instead of an array of structs.

commit 6193d9ceea552e67170dba45abde04c64271c705
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Thu Feb 27 14:09:19 2014 -0600

    Fixed bug in thread trees

commit ac5a2de1d17ffd460b00fee9757898525a09abae
Merge: 01b125e8 bd3c7ecf
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Thu Feb 27 11:59:33 2014 -0600

    Merge branch 'master' of https://github.com/tlrmchlsmth/blis

commit 01b125e815f19410e8e0611d088b84570e499e93
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Thu Feb 27 11:55:45 2014 -0600

    First pass at adding parallelism to BLIS.
    
    Added a multithreading infrastructure that should be independent of multithreading implementation in the future.
    Currently, gemm blocked variants 1f and 2f, and packm variant blocked variant 1 is parallelized.

commit c2b2ab62707e4174892aff3ce65f36f54878fae5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Feb 26 12:46:45 2014 -0600

    Deprecated panel stride alignment in bli_config.h.
    
    Details:
    - Removed BLIS_CONTIG_STRIDE_ALIGN_SIZE from bli_config.h of all
      configurations. It was already going unused in packm_init() since the
      recent 4m/3m commit. This setting was rarely, if ever, useful, and its
      existence only posed a potential risk for 4m/3m-based implementations.
    - Removed BLIS_CONTIG_STRIDE_ALIGN_SIZE usage from mem_pool_macro_defs.h.
    - Updated comments regarding CONTIG_STRIDE_ALIGN_SIZE in template
      micro-kernels.

commit f18aee83a5ac1b14808686fc3c5a3c846a1d99b9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 25 17:58:42 2014 -0600

    CHANGELOG update (for 0.1.1).

commit fde5f1fdece19881f50b142e8611b772a647e6d2 (tag: 0.1.1)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 25 13:34:56 2014 -0600

    Added extensive support for configuration defaults.
    
    Details:
    - Standard names for reference kernels (levels-1v, -1f and 3) are now
      macro constants. Examples:
        BLIS_SAXPYV_KERNEL_REF
        BLIS_DDOTXF_KERNEL_REF
        BLIS_ZGEMM_UKERNEL_REF
    - Developers no longer have to name all datatype instances of a kernel
      with a common base name; [sdcz] datatype flavors of each kernel or
      micro-kernel (level-1v, -1f, or 3) may now be named independently.
      This means you can now, if you wish, encode the datatype-specific
      register blocksizes in the name of the micro-kernel functions.
    - Any datatype instances of any kernel (1v, 1f, or 3) that is left
      undefined in bli_kernel.h will default to the corresponding reference
      implementation. For example, if BLIS_DGEMM_UKERNEL is left undefined,
      it will be defined to be BLIS_DGEMM_UKERNEL_REF.
    - Developers no longer need to name level-1v/-1f kernels with multiple
      datatype chars to match the number of types the kernel WOULD take in
      a mixed type environment, as in bli_dddaxpyv_opt(). Now, one char is
      sufficient, as in bli_daxpyv_opt().
    - There is no longer a need to define an obj_t wrapper to go along with
      your level-1v/-1f kernels. The framework now prvides a _kernel()
      function which serves as the obj_t wrapper for whatever kernels are
      specified (or defaulted to) via bli_kernel.h
    - Developers no longer need to prototype their kernels, and thus no
      longer need to include any prototyping headers from within
      bli_kernel.h. The framework now generates kernel prototypes, with the
      proper type signature, based on the kernel names defined (or defaulted
      to) via bli_kernel.h.
    - If the complex datatype x (of [cz]) implementation of the gemm micro-
      kernel is left undefined by bli_kernel.h, but its same-precision real
      domain equivalent IS defined, BLIS will use a 4m-based implementation
      for the datatype x implementations of all level-3 operations, using
      only the real gemm micro-kernel.

commit 15b51e990f1d21333b5f7af97c211756247336e5
Merge: 6363a9f6 fc04b5eb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Feb 21 09:04:32 2014 -0600

    Merge branch 'master' of github.com:fgvanzee/blis

commit fc04b5eb69868c341ce03f5ef1f02de4b8c121b0
Merge: b29e1c2b d1813c9d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Feb 21 09:04:13 2014 -0600

    Merge pull request #3 from figual/master
    
    New ARM armv7a kernels and Assembly file consideration in Makefile

commit d1813c9dee34410833db5061e6588ec1a6c9ecd4
Author: Francisco Igual <figual@pandaboard.(none)>
Date:   Fri Feb 21 15:14:31 2014 +0100

    Added new armv7a micro-kernels and configuration files from Werner Saar.

commit 0cd098c03a000ed9426a7e9135190696da8cadbc
Author: Francisco Igual <figual@pandaboard.(none)>
Date:   Fri Feb 21 15:12:30 2014 +0100

     o Modified Makefile to consider .S assembly microkernels.

commit 6363a9f658257fe3d814a3dce5308f807adb54a2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Feb 19 17:00:52 2014 -0600

    Added level-3 support for complex via 4m-/3m.
    
    Details:
    - Added the ability to induce complex domain level-3 operations via new
      virtual complex micro-kernels which are implemented via only real
      domain micro-kernels. Two new implementations are provided: 4m and 3m.
      4m implements complex matrix multiplication in terms of four real
      matrix multiplications, where as 3m uses only three and thus is
      capable of even higher (than peak) performance. However, the 3m method
      has somewhat weaker numerical properties, making it less desirable
      in general.
    - Further refined packing routines, which were recently revamped, and
      added packing functionality for 4m and 3m.
    - Some modifications to trmm and trsm macro-kernels to facilitate indexing
      into micro-panels which were packed for 4m/3m virtual kernels.
    - Added 4m and 3m interfaces for each level-3 operation.
    - Various other minor changes to facilitate 4m/3m methods.

commit b29e1c2b278c177e104c84ba462820ee8296df6c
Merge: ee60377e bd3c7ecf
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Feb 14 14:11:54 2014 -0600

    Merge pull request #2 from tlrmchlsmth/master
    
    Fixes and improvements to xeon phi implementation.

commit bd3c7ecfb54a9b9851c7d364f41c21e4cff52f6f
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Feb 14 14:05:57 2014 -0600

    Removing changes to input.general and input.operations

commit ce066863683cb4e910270cf8ab8e138b01ff3358
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Feb 14 13:40:24 2014 -0600

    Fixed more Xeon Phi bugs, especially with scattered update

commit 31134b5c7076423aee1b4f494e925f27171d97e6
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Feb 14 11:19:44 2014 -0600

    Some fixes, changes, and improvements to the microkernel to the Xeon Phi

commit ee60377e467862b9d8a7205c45dce5cf66c78c46
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 13 14:03:31 2014 -0600

    Shifted some fields in info_t.
    
    Details:
    - Shifted the pack order, pack buffer type, and structure type fields
      to make room for an extra bit in the pack type/status field.

commit bd3ab1ad4cf42f8bc30ab262acf8eccb49bb1a08
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 13 09:29:55 2014 -0600

    Minor fixes to trsm consistent with prev on trmm.
    
    Details:
    - Removed use of bli_min() and bli_max() that were only being used to
      try to support situations where the diagonal would intersect the
      short end of some micro-panels, which is situation that is disallowed
      at a higher level by various constraints on the register and cache
      blocksize. This only affected trsm_ll and trsm_lu.
    - Use panel stride as passed into the macro-kernel rather than compute
      it via k and PACKMR/PACKNR. This affects all macro-kernels of trsm.

commit 6260b0b5f8bd248f3f66e5a1c6854bdbd9d02ad0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 13 09:19:56 2014 -0600

    Fixed obscure bug in trmm_ll, trmm_lu.
    
    Details:
    - Fixed an obscure bug in left-hand trmm that would only manifest when
      non-zero register blocksize extensions (PACKMR > MR or PACKNR > NR)
      are used.
    - Removed use of bli_min() and bli_max() that were only being used to
      try to support situations where the diagonal would intersect the
      short end of some micro-panels, which is situation that is disallowed
      at a higher level by various constraints on the register and cache
      blocksize. This only affected trmm_ll and trmm_lu.
    - Use panel stride as passed into the macro-kernel rather than compute
      it via k and PACKMR/PACKNR. This affects all macro-kernels of trmm.

commit 16915c1c1e55c660bf82141cdadf7c0860d5b464
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 11 10:54:19 2014 -0600

    Fixed an obscure bug in packm_cxk().
    
    Details:
    - Fixed a bug in packm_cxk() whereby the packm ukernel was being chosen
      from ldp, which is always equal to PACKMR or PACKNR. The problem with
      this is that the pack ukernels were implicitly assuming that the
      panel dimension of the panel being packed was equal to ldp, which
      is not the case when the register blocksizes extensions are non-zero
      (ie: when PACKMR > MR or PACKNR > NR, whichever is applicable). This
      problem has been fixed by passing ldp into the pack ukernels, which
      now walk through the packed micro-panel region by incrementing by this
      value, rather than incrementing by the inherent panel dimension value
      assumed by each packm ukernel (e.g. 4 in the case of packm_ref_4xk).
    - Also fixed a very minor edge case inefficiency whereby pack ukernels
      smaller than the default were not being used in edge cases, and instead
      those situations were being handled by scal2m. This is related to the
      issue above, because the pack ukernel itself was being chosen based on
      ldp instead of the panel dimension.

commit b7da57b282c5a5e2208946e60309d2352f55351d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 11 10:28:23 2014 -0600

    Updated calls to packm_blk_var2() in testsuite.
    
    Details:
    - In ukernel testsuite modules, replaced calls to packm_blk_var2() with
      _var1(). Meant to include this in previous commit.

commit c255a293e25b2223c88e8800267cd06ad2a90041
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Feb 10 14:31:24 2014 -0600

    Consolidated packm_blk_var2 and var3.
    
    Details:
    - Consolidated the functionality previously supported by packm_blk_var2()
      and packm_blk_var3() into a new variant, packm_blk_var1().
    - Updates to packm_gen_cxk(), packm_herm_cxk.c(), and packm_tri_cxk()
      to accommodate above changes.
    - Removed packm_blk_var3() and retired packm_blk_var2() to
      frame/1m/packm/old.
    - Updated all level-3 _cntl_init() functions so that the new, more
      versatile packm_blk_var1 is used for all level-3 matrix packing.

commit 32d8f264ae7b28155f5d7b21dcc5ecb78da2e0ab
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Feb 9 10:07:37 2014 -0600

    Refactored packm variants.
    
    Details:
    - Revised packm_blk_var2() and _var3() by encapsulating the general,
      hermitian/symmetric, and triangular panel-packing subproblems into
      separate functions: packm_gen_cxk(), packm_herm_cxk(), and
      packm_tri_cxk(), respectively. Also, homogenized the packm code as
      well as the new specialized packm_*_cxk() code to further improve
      readability.

commit 6c8067028707947fcdf4f856a272e15bb9ed91e3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Feb 7 11:27:15 2014 -0600

    Renamed enumerated type in testsuite and modules.
    
    Details:
    - Renamed the test suite's "mt_impl_t" enumerated type to "iface_t", and
      renamed all corresponding "impl" variables to "iface".

commit 6c12598b1bc567f0b08f58aebdc753a1c1390378
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 6 18:26:35 2014 -0600

    Employ simpler INSERT_ macro for ref ukernels.
    
    Details:
    - Defined a new macro, INSERT_GENTFUNC_BASIC0, which takes only one
      argument--the base name of the function--and employed this macro
      in the reference micro-kernel files instead of the _BASIC macro,
      which takes one auxiliary argument. That argument was not being
      used and probably just acted to unnecessarily obfuscate.

commit 32cae66326b68706d0e695cfd60c9ca5bc32c534
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 6 18:06:42 2014 -0600

    Fixed some instances of sloppy 'restrict' usage.
    
    Details:
    - Fixed some technical incorrectness with some usage of the 'restrict'
      keyword in the reference trsm micro-kernels.
    - Tweak to testsuite/Makefile that causes rebuild if libblis was
      touched.

commit 7aceef7683e2a2aff3c7ec2a73508036af2e19e2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 6 17:31:19 2014 -0600

    Updated comments in macro-kernels.
    
    Details:
    - Updated (and fixed some errors in) the "Assumptions/assertions" comment
      section of macro-kernels.
    - Changed register blocksizes of reference configuration to MR = 8 and
      NR = 4. It's always good for MR != NR in the reference configuration
      since it may help uncover bugs related to non-square micro-kernels.

commit 8fd292aa78950bcdf556605718f09d13f9575abc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 6 14:32:21 2014 -0600

    Pass panel dimensions into macro-kernels.
    
    Details:
    - Modified the interfaces to the datatype-specific macro-kernels so that:
      - pd_a and pd_b are passed in (which contain the panel dimensions of
        packed panels of a and b).
      - rs_a and cs_b are no longer passed in (they were guaranteed to be 1).
    - Modified implementations of datatype-specific macro-kernels so pd_a,
      pd_b, cs_a, and rs_b are used instead of cpp macros for MR, NR, PACKMR,
      and PACKNR, respectively.
    - Declare temporary c matrices (ct) as being maxmr-by-maxnr, which for now
      is equivalent to being mr-by-nr. maxmr and maxnr are declared in a new
      header file bli_kernel_post_macro_defs.h.

commit 3404e6657eabb017cd1580a2f1dd8e6fb13df923
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Feb 5 11:19:10 2014 -0600

    Deprecated incremental blocksize macro const defs.
    
    Details:
    - Removed macro constant definitions related to incremental blocksizes
      from all configurations' bli_kernel.h files. This change is minor and
      is mostly a cleanup related to a previous commit.

commit 1e9afd39a63e0a58167d4439c1a0a880a4a35657
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 4 20:15:19 2014 -0600

    Comment updates (removed vestiges of "bd").

commit 5cf58f7c2d5bc0d2d94d9576f7158d8f133b7aac
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 4 09:15:19 2014 -0600

    Added early returns for "object is zeros" case.
    
    Details:
    - Added some logic to packm_init(), pack_int() and gemm_int() so that
      (a) objects marked as BLIS_ZEROS are not packed, and (b) those
      objects are not computed with. This functionality is not currently
      needed by any existing implementations, but may be used in the
      future.

commit 6bbd4be769a9b344a55abe5ddaca1a99fd29f7b4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Feb 3 13:15:25 2014 -0600

    Added 'f' on some gemm and trmm blocked variants.
    
    Details:
    - Added 'f' to some block variant files/functions to be consistent with
      other file/functions' naming convention. Here, the f indicates
      partitioning in the "forward" direction.

commit eb13cb2c6b182df5e2a9b88c76f50e2cee25b9e0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Feb 3 11:07:01 2014 -0600

    Removed redundant non-gemm blksz_t creation.
    
    Details:
    - Removed code that creates duplicate blksz_t objects for herk, trmm,
      and trsm. Instead, the gemm blksz_t objects are accessed via extern
      and used directly. This reduces the amount of code associated with
      each of the three _cntl_init() and _cntl_finalize() function.

commit 0a023a7d9e58e53b8c204a5f49aa8ca9afeba938
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jan 29 14:02:08 2014 -0600

    Introduced new level-3 front-end layer.
    
    Details:
    - Added new _front() functions for each level-3 operation. This is done
      so that the choosing of the control tree (and *only* the choosing of
      the control tree) happens in what was previously the "front end"
      (e.g. bli_gemm()). That control tree is then passed into the _front()
      function, which then performs up-front tasks such as parameter
      checking.

commit 251c5d112196d37b183e554bc9d406104aed65fb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jan 28 19:40:29 2014 -0600

    Removed redundant hemm, her2k control trees.
    
    Details:
    - Removed code that generated a control tree specifically for hemm and
      symm. Instead, the gemm control tree is now configured so that it
      works for gemm, hemm, or symm.
    - Retired most her2k code, as it was not being used. (Currently, her2k is
      implemented as two invocations of herk.) I couldn't think of many
      situations where her2k variants were needed.
    - Removed some older her2k code.

commit 5a36e5bf2f59d1e85d6dbce32a07d604c5e82d11
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jan 27 11:13:00 2014 -0600

    Embed func_t microkernel objects in control trees.
    
    Details:
    - Modified all control tree node definitions to include a new field of
      type func_t*, which is similar to a blksz_t except that it contains
      one function pointer (each typed simply as void*) for each datatype.
      We use the func_t* to embed pointers to the micro-kernels to use for
      the leaf-level nodes of each control tree. This change is a natural
      extension of control trees and will allow more flexibility in the
      future.
    - Modified all macro-kernel wrappers to obtain the micro-kernel pointers
      from the incomming (previously ignored) control tree node and then pass
      the queried pointer into the datatype-specific macro-kernel code, which
      then casts the pointer to the appropriate type (new typedefs residing
      in bli_kernel_type_defs.h) and then uses the pointer to call the micro-
      kernel. Thus, the micro-kernel function is no longer "hard-coded" (that
      is, determined when the datatype-specific macro-kernel functions are
      instantiated by the C preprocessor).
    - Added macros to bli_kernel_macro_defs.h that build datatype-specific
      base names if they do not exist already, and then uses those to build
      datatype-specific micro-kernel function names. This will allow
      developers extra flexibility if they wanted to, for example, name each
      of their datatype-specific micro-kernels differently (e.g. double
      real might be named bli_dgemm_opt_4x4() while double complex might be
      named bli_zgemm_opt_2x2()).
    - Inserted appropriate code into _cntl_init() functions that allocates
      and initializes a func_t object for the corresponding micro-kernels.
      The gemm ukernel func_t object is created once, in bli_gemm_cntl_init(),
      and then reused via extern wherever possible.

commit 6cbd6f1c7f1915180aa28939833afde48665c5ae
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jan 24 10:38:29 2014 -0600

    Removed commented mixed domain macro-kernel code.
    
    Details:
    - Removed commented-out code from macro-kernels that was supposed to
      facilitate implementing mixed domain (complex times real) matrix
      multiplication. This functionality is still (probably possible),
      but I'm getting tired of looking at the code every time I edit
      a macro-kernel. Plus, there are probably ways of doing it at a
      higher level, via control trees.

commit 29778be1119f1a884330d7f8dc424a2df4101d58
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jan 22 16:03:11 2014 -0600

    Removed b_aux field from cntl nodes.
    
    Details:
    - Removed b_aux field from all control tree node definitions. This field
      was being used in certain optimizations (incremental blocking) that were
      not actually being employed within BLIS, and are probably not employed
      by others.
    - Updated all _cntl_obj_create() function definitions and invocations
      according to above change.
    - Retired bli_gemm_blk_var4.c, which was one such function that employed
      incremental blocking, but which was never called by BLIS itself.

commit 06ac727a42ec9e832c7832745036702014638f99
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jan 15 16:44:52 2014 -0600

    Updated some comments in level-3 front ends.

commit d628bf1da1560f1f5126a1ddfed8714f0a4b8da3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jan 15 11:40:12 2014 -0600

    Consolidated pack_t enums; retired VECTOR value.
    
    Details:
    - Changed the pack_t enumerations so that BLIS_PACKED_VECTOR no longer has
      its own value, and instead simply aliases to BLIS_PACKED_UNSPEC. This
      makes room in the three pack_t bits of the info field of obj_t so that
      two values are now unused, and may be used for other future purposes.
    - Updated sloppy terminology usage in comments in level-2 front-ends.
      (Replaced "is contiguous" with more accurate "has unit stride".)

commit ddc8c1c379b4787be5954802906593d7ea144452
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jan 13 14:55:43 2014 -0600

    Suppress warning in Makefile (UNINSTALL_LIBS).
    
    Details:
    - Redirect errors to /dev/null when using 'find' to locate libraries that
      would be uninstalled upon executing "make uninstall-old". Before, if the
      Makefile was read before $(INSTALL_PREFIX)/lib existed, a "No such file
      or directory" message was emitted. This message was harmless, but is now
      suppressed in this situation.

commit f8f67d7251bffc05020e20527c100c8115fd5e55
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jan 10 09:06:11 2014 -0600

    Typecast bli_getopt() return value in testsuite.
    
    Details:
    - In the test suite driver, inserted an explicit typecast of the return
      value of bli_getopt() prior parsing. The lack of typecast caused a
      problem on at least one system whereby a return value of -1 was
      interpreted as garbage character. Thanks to Francisco Igual for finding
      and submitting this fix.

commit e7f154fe2ed3e10e2323cefe5d25c2c23ac902c4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jan 10 08:48:07 2014 -0600

    Applied edge case fix to arm/neon microkernel.
    
    Details:
    - Applied an edge case bugfix, courtesy of Francisco Igual, to the current
      double precision real gemm microkernel in kernels/arm/neon/3.

commit 89c76a8a51d070d263c13bfa5ace65769509f2b4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jan 9 12:08:37 2014 -0600

    Allow building outside source distribution.
    
    Details:
    - Modified build system (mostly configure and top-level Makefile) so that
      a user can build a BLIS library outside of the top-level directory of
      the source distribution.
    - Added "test" target to Makefile so that the user can run "make test",
      which will compile, link, and run the testsuite binary. This works even
      if the build directory is externally located, thanks to the test suite
      binary's new -g and -o command-line options. Also, when creating the
      test suite via the top-level Makefile, the linking is against the
      local archive, in lib/<configname>, rather than at <install_prefix>/lib.
    - Modified testsuite/Makefile so that it links against the library built
      locally, in ../lib/<configname>.
    - Added "-lm" to LDFLAGS of most configurations' make_defs.mk.
    - Various other cleanups to build system.

commit 12fa82ec12cc340ab28552997d9d50f7c98691f8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jan 8 16:09:26 2014 -0600

    Implemented bli_getopt().
    
    Details:
    - Added bli_getopt.c and .h files to frame/base. These files implement
      a custom version of getopt(), which may be used to parse command line
      options passed into a program via argc/argv. I am implementing this
      function myself, as opposed to using the version available via unistd.h,
      for portability reasons, as the only requirements are string.h (which
      is available via the standard C library).
    - Modified test suite to allow the user to specify the file name (and/or
      path) to the parameters and operations input files: -g may be used to
      specify the general input file and -o to specify the operations input
      file). If -g or -o or both are not given, default filenames are assumed
      (as well as their existence in the current directory).

commit cafb58e86ea5cfb21b9eedc57ca8ebbf24252098
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jan 6 13:28:36 2014 -0600

    Updated template micro-kernels to use auxinfo_t.
    
    Details:
    - Updated template micro-kernel implementations (located in
      config/template/kernels), to adhere to the new auxinfo_t interface.
      Meant to include this change in a0331fb1.
    - Changed template configuration to use 64-bit integers (for both BLIS
      and the BLAS compatibility layer).

commit 9ab126b499c3805045020cb89a8a5848e28d3bf5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jan 6 12:13:26 2014 -0600

    Removed error checks in netlib->BLIS param mapping
    
    Details:
    - Disabled error checking in netlib-to-BLIS parameter mapping functions.
      If the char value input to these functions was not one of the defined
      values, bli_check_error_code() with the appropriate error code value
      would be called, resulting in an abort(). This was unnecessary and
      redundant since these routines are currently only used within the
      BLAS compatibility layer, and they are only called AFTER parameter
      checking has already been performed on the original BLAS char values.
      If the application tried to override xerbla() to prevent an abort()
      from being called, this error checking would still get in the way.
      Thus, instead of reporting the error situation to the framework (ie:
      calling abort()), an arbitrary BLIS parameter value is now chosen and
      the function returns normally. Thanks to Jeff Hammond for finding and
      reporting this issue.

commit 2cb13600f9f9601c60e7f96f4ca159d169ade9cb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jan 3 12:29:13 2014 -0600

    Updated year in copyright headers to 2014.

commit 290fa54e0083c9c837188b8321b13b1b282e7b0c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Dec 20 14:10:26 2013 -0600

    Store variable panel strides in trmm/trsm auxinfo.
    
    Details:
    - Changed the value being stored into the auxinfo_t structure in trmm
      and trsm macro-kernels. Whereas before we stored whatever value was
      provided to the macro-kernel implementation via ps_a/ps_b, now we
      store the stride that will advance to the next variable-length
      micro-panel of the triangular matrix A (left) or B (right).
    - Whitespace changes to the files affected above.

commit e3a6c7e77667fd749248df3f75f880266c3136ec
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Dec 19 16:29:31 2013 -0600

    Macroized conditionals for a2/b2 in macro-kernels.
    
    Details:
    - Replaced conditional expressions in macro-kernels related to computing
      the addresses a2 and b2 (a_next and b_next) with a preprocessor macro
      invocation, bli_is_last_iter(), that tests the same condition.
    - Updated gemm_ukr module to use auxinfo_t argument.
    - Whitespace changes in test suite ukr modules.

commit a0331fb10a50393e31d16339053b75b944132da1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Dec 19 14:50:11 2013 -0600

    Introduced auxinfo_t argument to micro-kernels.
    
    Details:
    - Removed a_next and b_next arguments to micro-kernels and replaced them
      with a pointer to a new datatype, auxinfo_t, which is simply a struct
      that holds a_next and b_next. The struct may hold other auxiliary
      information that may be useful to a micro-kernel, such as micro-panel
      stride. Micro-kernels may access struct fields via accessor macros
      defined in bli_auxinfo_macro_defs.h.
    - Updated all instances of micro-kernel definitions, micro-kernel calls,
      as well as macro-kernels (for declaring and initializing the structs)
      according to above change.

commit 392428dea4001fe4384efe29f6cde32f8abeeb35
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Dec 12 19:01:47 2013 -0600

    Added "ri" scalar macros.
    
    Details:
    - Added set of basic scalar macros that take arguments' real and
      imaginary components separately, named like the previous set except
      with the "ris" (instead of "s") suffix.
    - Redefined the previous set of scalar macros (those that take arguments
      "whole") in terms of the new "ri" set.
    - Renamed setris and getris macros to sets and gets.
    - Renamed setimag0 macros to seti0s.
    - Use bli_?1 macro instead of a local constant in bla_trmv.c, bla_trsv.c.

commit f60c8adc2f61eaba06b892f4e73000159de93056
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Dec 10 14:39:56 2013 -0600

    Minor updates to dunnington configuration.
    
    Details:
    - Added commented alternatives to dunnington configuration's bli_kernel.h.
    - Minor reformatting of optimization flag variables in make_defs.mk.

commit 4ef20150492db254b5baf2368add62e19b0ac11b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 9 18:53:03 2013 -0600

    Tweaks to dunnington configuration (x86_64/core2).
    
    Details:
    - Updated BLIS_DEFAULT_KC_D from 256 to 384.
    - Enabled cache blocksize extension of up to 25% for MC and KC (for
      double-precision real).

commit 5ad2ce7bf5ba3ea955e6d517bfd270e02820263b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 9 18:30:49 2013 -0600

    Minor x86_64 (core2) kernel fixes.
    
    Details:
    - Fixed copy-and-paste bug whereby [scz]gemmtrsm_u_opt_d4x4 kernels
      for x86_64/core2 were calling the wrong reference code (l instead
      of u).
    - Fixed some unused variables in x86_64/core2 dotaxpyv and dotxaxpyf
      kernels.
    - Minor typecasting fix in testsuite/src/test_libblis.c.
    - Makefile updates.

commit d289f5d3a9c0e1a68a17c1c32b736e282a289c4c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Dec 5 10:56:13 2013 -0600

    Whitespace changes to level-2 blocked variants.
    
    Details:
    - Joined some lines in level-2 blocked variants to match formatting used
      in level-3 blocked variants.
    - Streamlined implementation of bli_obj_equals() in bli_query.c.

commit b444489f100d218bc8ef29b01ff8489c358559f9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Dec 3 16:08:30 2013 -0600

    Added new "attached" scalar representation.
    
    Details:
    - Added infrastructure to support a new scalar representation, whereby
      every object contains an internal scalar that defaults to 1.0. This
      facilitates passing scalars around without having to house them in
      separate objects. These "attached" scalars are stored in the internal
      atom_t field of the obj_t struct, and are always stored to be the same
      datatype as the object to which they are attached. Level-3 variants no
      longer take scalar arguments, however, level-3 internal back-ends stll
      do; this is so that the calling function can perform subproblems such
      as C := C - alpha * A * B on-the-fly without needing to change either
      of the scalars attached to A or B.
    - Removed scalar argument from packm_int().
    - Observe and apply attached scalars in scalm_int(), and removed scalar
      from interface of scalm_unb_var1().
    - Renamed the following functions (and corresponding invocations):
    
       bli_obj_init_scalar_copy_of()
                               -> bli_obj_scalar_init_detached_copy_of()
       bli_obj_init_scalar()   -> bli_obj_scalar_init_detached()
       bli_obj_create_scalar_with_attached_buffer()
                               -> bli_obj_create_1x1_with_attached_buffer()
       bli_obj_scalar_equals() -> bli_obj_equals()
    
    - Defined new functions:
    
       bli_obj_scalar_detach()
       bli_obj_scalar_attach()
       bli_obj_scalar_apply_scalar()
       bli_obj_scalar_reset()
       bli_obj_scalar_has_nonzero_imag()
       bli_obj_scalar_equals()
    
    - Placed all bli_obj_scalar_* functions in a new file, bli_obj_scalar.c.
    - Renamed the following macros:
    
       bli_obj_scalar_buffer() -> bli_obj_buffer_for_1x1()
       bli_obj_is_scalar()     -> bli_obj_is_1x1()
    
    - Defined new macros to set and copy internal scalars between objects:
    
       bli_obj_set_internal_scalar()
       bli_obj_copy_internal_scalar()
    
    - In level-3 internal back-ends, added conditional blocks where alpha and
      beta are checked for non-unit-ness. Those values for alpha and beta are
      applied to the scalars attached to aliases of A/B/C, as appropriate,
      before being passed into the variant specified by the control tree.
    - In level-3 blocked variants, pass BLIS_ONE into subproblems instead of
      alpha and/or beta.
    - In level-3 macro-kernels, changed how scalars are obtained. Now, scalars
      attached to A and B are multiplied together to obtain alpha, while beta
      is obtained directly from C.
    - In level-3 front-ends, removed old function calls meant to provide
      future support for mixed domain/precision. These can be added back later
      once that functionality is given proper treatment. Also, removed the
      creating of copy-casts of alpha and beta since typecasting of scalars
      is now implicitly handled in the internal back-ends when alpha and
      beta are applied to the attached scalars.

commit 992de486d6f23e69a623abd15ae77d7881d13871
Merge: 9552e6ee fd4ac636
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 2 13:58:46 2013 -0600

    Unimplemented kernels now call reference.
    
    Details:
    - Updated arm, bgq, loongson3a, and x86_64 kernels so that unimplemented
      datatypes call the corresponding reference kernel. Previously, these
      kernel functions called abort() with a "not yet implemented" error
      message.

commit fd4ac636d9a55cec1476a444bd4e70def219dc8f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 2 13:50:36 2013 -0600

    Unimplemented kernels now call reference.
    
    Details:
    - Updated micro-kernels for arm, bgq, loongson3a, and x86_64 so that
      unimplemented kernel functions simply call the corresponding reference
      implementation. (Previously, these unimplemented functions would
      abort() with a "not yet implemented" message.)

commit 9552e6ee824d4345d5e908e869e071d19829819a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Nov 24 11:40:31 2013 -0600

    Removed optional scaling from packm control tree.
    
    Details:
    - Removed does_scale field from packm control tree node and
      bli_packm_cntl_obj_create() interface. Adjusted all invocations of
      _cntl_obj_create() accordingly.
    - Redefined/renamted macros that are used in aliasing so that now,
      bli_obj_alias_to() does a full alias (shallow copy) while
      bli_obj_alias_for_packing() does a partial alias that preserves the
      pack_mem-related fields of the aliasing (destination) object.
    - Removed bli_trmm3_cntl.c, .h after realizing that the trmm control tree
      will work just fine for bli_trmm3().
    - Removed some commented vestiges of the typecasting functionality needed
      to support heterogeneous datatypes.

commit e65c476284db9ef64b23191a21c2584b1083342f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Nov 19 10:05:35 2013 -0600

    Minor updates to packm_blk_var2.c and _blk_var3.c.
    
    Details:
    - Comment updates to packm_blk_var2.c and packm_blk_var3.c.
    - In packm_blk_var2(), call setm_unb_var1(), scal2m_unb_var1() directly
      instead of setm(), scal2m().

commit 9e1d0d4bca48eda54301d8976f203e2544c9df3a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 18 18:11:07 2013 -0600

    Added trsm_l, trsm_u ukernels for x86_64/core2.
    
    Details:
    - Added standalone trsm_l/trsm_u micro-kernels for x86_64 (core2).
      These kernels are based on the gemmtrsm_l/gemmtrsm_u micro-kernels
      that already existed in kernels/x86_64/core2-sse3/3.

commit 85e7e02ea3a9190b6fcff5d46b00d41c79cb1242
Merge: 67761e22 70720054
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 18 12:02:00 2013 -0600

    Merge branch 'master'. Forgot to git-pull.

commit 67761e224c92500eecf9c1540cc72bdd2fb27679
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 18 11:57:40 2013 -0600

    Attempting to fix errors in bgq build.
    
    Details:
    - Removed restrict declaration from b_cast and c_cast from
      bli_trsm_lu_ker_var2.c and bli_trsm_rl_ker_var2.c. Curiously, they
      are causing problems for xlc only in those two files and no other
      macro-kernels.
    - Fixed (hopefully) kernel function parameter type declarations in
      kernels/bgq/1f/bli_axpyf_opt_var1.c and kernels/bgq/3/bli_gemm_8x8.c.

commit 707200541d344f98cf34c9801954dbb36fbe0447
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 18 11:17:31 2013 -0600

    Syntax error fix in x86_64/core2 gemmtrsm_u ukr.

commit bbe2b84a49e7785d4d0c514cda34adfbe66478b0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 18 11:11:06 2013 -0600

    Updated Makefile in test, testsuite.
    
    Details:
    - Updated Makefiles in test and testsuite directories to use the new
      BLIS header installation directory scheme, which is to compile with
      -I<PREFIX>/include/blis instead of -I<PREFIX>/include.

commit 9bd7fcfd436625ca2108128086671319362f4d92
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 18 10:58:09 2013 -0600

    Outer-to-inner 'restrict' fix in macro-kernels.
    
    Details:
    - Fixed sloppy placement of 'restrict' pointer declarations in level-3
      macro-kernels. Previously, all restricted pointers were being declared
      at the outer-most function scope level. While this violates the C99
      standard, very few of the compilers used with BLIS so far have seemed
      to care. The lone exception has been IBM's xlc. Thanks to Tyler Smith
      for identifying this bug (and suggesting the fix).

commit 50549a6a31dd26cf63a013e0ede16b2c7ce835b6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Nov 17 18:31:27 2013 -0600

    Changed header install directory to include/blis.
    
    Details:
    - Changed top-level Makefile so that headers are installed to
      $(INSTALL_PREFIX)/include/blis/. (Header directories are no longer
      named by version/configuration and then symlinked.)
    - Added uninstall targets, including uninstall-old to clean out old
      library archives.
    - Added GREP makefile definitions to all configurations' make_defs.mk.

commit d70733abddfb9a95661897e1e4f3c1f3cfa7cbaa
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Nov 16 17:34:25 2013 -0600

    Added ARM kernels, configurations.
    
    Details:
    - Added kernels for ARM, and configurations for Cortex-A9 and Cortex-A15.
      Thanks to Francisco Igual for contributing these kernels and
      configurations.

commit d37c2cff62089c86983c2f79762f4b5329037373
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 13 10:47:11 2013 -0600

    Minor comment and Makefile changes.
    
    Details:
    - Added missing 'check-config' and 'check-make-defs' targets to
      testsuite/Makefile.
    - Removed unused 'test' target from top-level Makefile.
    - Comment changes to testsuite input files.

commit 19885f893a17b91ee79bead0620d0f913392d4c5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 11 12:09:21 2013 -0600

    Updated some kernel comment headers.
    
    Details:
    - Updated bgq and piledriver comment headers to use BLIS copyright header
      instead of libflame.

commit 1a4d698f42981d74fe5f29b980031e1ee7dc42d5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 11 10:15:40 2013 -0600

    CHANGELOG update (for 0.1.0).

commit 089048d5895a30221b6b1976c9be93ad6443420d (tag: 0.1.0)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Nov 9 17:18:00 2013 -0600

    Added object wrappers to 1f test suite modules.
    
    Details:
    - Added missing object wrappers to level-1f test suite modules. This was
      only apparent if you were configuring with something other than the
      reference configuration.
    - Commented out object-wrappers in level-1f front-ends. These were not
      working as intended the reference configuration was selected, because
      most kernel sets, such as those in the template set, do not have object
      wrappers.
    - Whitespace changes to template micro-kernels.
    - Comment changes to template level-1f kernel headers.

commit 9ef3752079de10124bed906b5d28479d04aa8187
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Nov 8 17:20:47 2013 -0600

    Updated template kernels wrt KernelsHowTo wiki.
    
    Details:
    - Merged latest state of KernelsHowTo wiki into template micro-kernels
      located in config/template/kernels/3.

commit 376bbb59c8944e29c5c1ff6637920d8451370afa
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Nov 8 11:17:34 2013 -0600

    Removed support for duplication.
    
    Details:
    - Removed support for duplication from the gemmtrsm/trsm micro-kernels
      and all framework code.
    - Updated test suite modules according to above changes.

commit 68a5910974b62b4df853fae2a68cb04df9d5a19c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Nov 7 11:36:11 2013 -0600

    Added comments to testsuite/input.operations.
    
    Details:
    - Added extensive comments to the top of testsuite/input.operations,
      which describe how to edit the file.
    - Removed input.operations.0 and input.operations.1.
    - Changed input.general to test all datatypes ("sdcz") by default.

commit a98f78b715fb256a519870071bb5266130d70b21
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 6 15:32:47 2013 -0600

    Changed dim_t and inc_t to be signed integers.
    
    Details:
    - Redefined dim_t and inc_t in terms of gint_t (instead of guint_t).
      This will facilitate interoperability with Fortran in the future.
      (Fortran does not support unsigned integers.)
    - Redefined many instances of stride-related macros so that they return
      or use the absolute value of the strides, rather than the raw strides
      which may now be signed. Added new macros bli_is_row_stored_f() and
      bli_is_col_stored_f(), which assume positive (forward-oriented) strides,
      and changed the packm_blk_var[23] variants to use these macros instead
      of the existing bli_is_row_stored(), bli_is_col_stored().
    - Added/adjusted typecasting to to various functions/macros, including
      bli_obj_alloc_buffer(), bli_obj_buffer_at_off(), and various pointer-
      related macros in bli_param_macro_defs.h.
    - Redefined bli_convert_blas_incv() macro so that the BLAS compatibility
      layer properly handles situations where vector increments are negative.
      Thanks to Vladimir Sukharev for pointing out this issue.
    - Changed type of increment parameters in bli_adjust_strides() from dim_t
      to inc_t. Likewise in bli_check_matrix_strides().
    - Defined bli_check_matrix_object(), which checks for negative strides.
    - Redefined bli_check_scalar_object() and bli_check_vector_object() so
      that they also check for negative stride.
    - Added instances of bli_check_matrix_object() to various operations'
      _check routines.

commit 1f8afc3e08a4312cfe810be86aedeacbc57275c5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 6 10:09:10 2013 -0600

    Minor comment update to BLAS compat files.

commit 1abbf768afafc158d44e4d5c4a135cfd9e277f13
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 4 15:50:00 2013 -0600

    Fixed bugs in scalv and setv.
    
    Details:
    - Fixed bugs similar to those addressed in cca1e1f51dc6, whereby
      a segmentation fault may occur if beta is not the same type as
      the vector operand for scalv and setv.
    - Changed axpyv and scal2v front-ends in a similar fashion.

commit f5953259a1842ee48e5833c22ac86e68a337bfe1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 4 14:43:55 2013 -0600

    Fixed a bug related to Hermitian matrix diagonals.
    
    Details:
    - Fixed a bug whereby BLIS assumed that the imaginary components of the
      diagonal elements of Hermitian matrices were already zero. This property
      is now enforced when the matrix is packed (bli_packm_blk_var2). Thanks
      to Vladimir Sukharev for reporting this bug.
    - Minor comment updates to template kernels.

commit d70f2b089dac8b9e4c19295dfa6014c36afee2ec
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Nov 2 17:19:40 2013 -0500

    Added scaling to abval2s, sqrt2s macros.
    
    Details:
    - Re-defined abval2s and sqrt2s macros to use scaling to avoid underflow
      and overflow from squaring the real and imaginary components. (This is
      the same technique used to fix recent bugs in invscals/invscaljs and
      inverts.)

commit c5b1ed9409ae2f71d04041eef5da9a0080b5784a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Nov 1 10:28:04 2013 -0500

    Added new dotxaxpyf variant 2.
    
    Details:
    - Added a new variant for dotxaxpyf that is based on dotxf and axpyf
      kernels. By default, this variant is not used by any other operation.

commit 97f89fbcf202d72fc440b614708e352ea31633e2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Nov 1 10:16:39 2013 -0500

    Fixed bug in complex invscals.
    
    Details:
    - Fixed complex inversion in invscals and invscaljs whereby the
      imaginary component was being computed incorrectly.
    - Use bli_fmaxabs() instead of bli_fabs() when choosing the scalar
      in inverts, invscals, and invscaljs.
    - Changed bli_abs() and bli_fabs() macro definitions to use "<="
      operator instead of "<".

commit eda42a21d17a2742eab69ab801ed530b82488c8a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 31 18:00:44 2013 -0500

    Defined missing symbols in bla_rotg.c
    
    Details:
    - Defined local equivalents of libf2c's r_sign(), d_sign(), c_abs(), and
      z_abs(), which are needed by bla_rotg.c. Also defined r_abs() and
      d_abs() for completeness. Thanks to Vladimir Sukharev for reporting
      these bugs.

commit cca1e1f51dc67a2c3725d5c1837256831aaf70f8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 30 14:39:01 2013 -0500

    Fixed bugs in scalm and setm.
    
    Details:
    - Fixed bugs in scalm and setm that resulted in segmentation faults when
      beta is not the same type as the matrix operand. Thanks to Vladimir
      Sukharev for reporting this bug.
    - Changed axpym and scal2m front-ends in fashion similar to that of scalm
      and setm; namely, the alpha scalar is copy-cast the type of the first
      matrix operand.
    - Changed the template and reference configurations' bli_config.h files
      so that the number of memory allocator blocks of A and B are set based
      on BLIS_MAX_NUM_THREADS.
    - Comment updates to bli_obj.c and variable rename in bla_nrm2.c.

commit 2807013a4761c2b84b3944de64d23483ad7ef2fb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 24 14:32:20 2013 -0500

    Fixed over/under-flow in complex inversion.
    
    Details:
    - Fixed the complex bli_?inverts() macros, which were inverting elements
      in an "unsafe" manner, such that very large and very small values were
      unnecessarily over/under-flowing. Thanks for Vladimir Sukharev for
      reporting this bug.
    - Comment update to bli_sumsqv_unb_var1.c.
    - Removed redundant bli_min() macro in bli_scalar_macro_defs.h.
    - Changed 1.0F to 1.0 for bli_drands() macro.

commit 45a80c625f84edb2ade6ac25efe2b9c589d7e0df
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 23 12:15:25 2013 -0500

    Fixed parameter checking issue in BLAS syr[2]k.
    
    Details:
    - Fixed a minor parameter checking bug in the BLAS compatibility layer
      for [sd]syrk and [sd]syr2k. Specifically, if 'C' is passed in for the
      trans parameter of either operation, it is (a) allowed, and (b) treated
      as 'T' (whereas previously it was disallowed). Thanks for Vladimir
      Sukharev for finding and reporting this bug.

commit a091a219bda55e56817acd4930c2aa4472e53ba5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 14 10:11:29 2013 -0500

    Minor fixes to piledriver configuration, ukernel.
    
    Details:
    - Applied a patch from Tyler that fixes minor staleness in the piledriver
      configuration and gemm micro-kernel.
    - Very minor changes to test suite input files.

commit dacdde27aee4fb90b14880136d7f20c6b234e2c6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 11 11:37:19 2013 -0500

    Added Fran's Sandy Bridge kernels/configuration.
    
    Details:
    - Added a kernel directory for kernels developed by Francisco Igual for
      the Sandy Bridge architecture, including a dgemm ukernel coded with
      AVX intrinsics.
    - Added a configuration for Sandy Bridge using values supplied by Fran.

commit 03106d650e4030d4c9831683448376f92fc52d41
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 11 10:40:38 2013 -0500

    Fixed minor perf bug in gemm_ker_var2.
    
    Details:
    - Fixed a minor performance bug in bli_gemm_ker_var2.c (and the experimental
      bli_gemm_ker_var5.c) whereby the addresses for a_next and b_next are not
      computed correctly (ie: do not wraparound) at the edge cases. Thanks to
      Tze Meng for helping me identify this bug.

commit b053337387dbdef9035be03538222670a21707ca
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 10 18:26:55 2013 -0500

    Added fusing factors, MR/NR to test suite output.
    
    Details:
    - Updated the test suite driver (and modules where appropriate) so that
      the level-1f fusing factors are output along with the variable dimension.
      While this is not strictly necessary, since the fusing factors are output
      in the initial parameter summary, it allows extra reassurance to the user
      since the fusing factors appear alongside the variable dimension, which
      together give a complete picture of the problem size. Similar changes were
      made for outputting the register blocksizes when reporting results for the
      micro-kernel test modules.

commit be4833bd91c5a58d0bfc52daaadf7ba543a77acf
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 10 14:20:06 2013 -0500

    Added test suite modules for level-1f, 3 kernels.
    
    Details:
    - Added test modules in test suite for level-1f kernels and level-3
      micro-kernels. (Duplication in the micro-kernels, for now, is NOT
      supported by these test modules.)
    - Added section override switches to test suite's input.operations file.
    - Added obj_t APIs for level-1f front-ends and their unblocked variants to
      facilitate the level-1f test modules. Also added front-end for dupl
      operation.
    - Added obj_t-based check routines for level-1f operations, which are
      called from the new front-ends mentioned above.
    - Added query routines for axpyf, dotxf, and dotxaxpyf that return fusing
      factors as a function of datatype, which is needed by their respective
      test modules.
    - Whitespace changes to bli_kernel.h of all existing configurations.

commit 680188d46bb15b9a1a2867638104939dc77ca2a1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 10 13:23:37 2013 -0500

    Cleaned up old test drivers.
    
    Details:
    - Minor updates to old test drivers in preparation for our participation
      in ACM TOMS's replicated results initiative.

commit 3690bdd4f95769c935c410414112102cc3e108b1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 10 11:45:33 2013 -0500

    More updates to level-1f kernels for core2-sse3.
    
    Details:
    - Changed types in function signatures to match new prototypes. Meant to
      include this in previous commit.

commit 661d5120cd7071f9b0c5cefc95f99f1361370ade
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 10 11:27:27 2013 -0500

    Fixed outdated fusing factor macros in 1f kernels.
    
    Details:
    - Updated level-1f kernels for x86_64 and bgq to use renamed fusing factor
      macros. Meant to include this in 5e54f46c. Thanks to Fran for pointing
      this out.

commit 73aa1e9f31d1b2a319c7e711ced6db3f9835c832
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Oct 1 17:01:18 2013 -0500

    Added section overrides to test suite.
    
    Details:
    - Added new lines of input to the test suite's input.operations file, which
      allows the user to disable entire sections (levels) of tests. Before this
      change, the user had to manually disable each operation tests's "master
      switch". (This is why input.operations.0 existed: to allow a more
      convenient starting point for someone who only wanted to test one or a
      few operations.)

commit 5e54f46ccb76beab892d530b693e07c6bf6db7cf
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Sep 30 12:58:18 2013 -0500

    Added template implementations and other tweaks.
    
    Details:
    - Added a 'template' configuration, which contains stub implementations of the
      level 1, 1f, and 3 kernels with one datatype implemented in C for each, with
      lots of in-file comments and documentation.
    - Modified some variable/parameter names for some 1/1f operations. (e.g.
      renaming vector length parameter from m to n.)
    - Moved level-1f fusing factors from axpyf, dotxf, and dotxaxpyf header files
      to bli_kernel.h.
    - Modifed test suite to print out fusing factors for axpyf, dotxf, and
      dotxaxpyf, as well as the default fusing factor (which are all equal
      in the reference and template implementations).
    - Cleaned up some sloppiness in the level-1f unb_var1.c files whereby these
      reference variants were implemented in terms of front-end routines rather
      that directly in terms of the kernels. (For example, axpy2v was implemented
      as two calls to axpyv rather than two calls to AXPYV_KERNEL.)
    - Changed the interface to dotxf so that it matches that of axpyf, in that
      A is assumed to be m x b_n in both cases, and for dotxf A is actually used
      as A^T.
    - Minor variable naming and comment changes to reference micro-kernels in
      frame/3/gemm/ukernels and frame/3/trsm/ukernels.

commit 97aaf220a847363b4da35935eca17790c0ef71f6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 17 10:51:36 2013 -0500

    Added new kernels, configurations.
    
    Details:
    - Added various micro-kernels for the following architectures:
        Intel MIC
        IBM BG/Q
        IBM Power7
        AMD Piledriver
        Loogson 3A
      and reorganized kernels directory. Thanks to Tyler Smith, Mike Kistler,
      and Xianyi Zhang for contributing these kernels.
    - Added configurations corresponding to above architectures, and renamed
      "clarksville" configuration to "dunnington".

commit fe979c5a114c877506a5697cdab1fc8cf2bcd303
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Sep 13 14:31:53 2013 -0500

    Removed default configuration behavior.
    
    Details:
    - Changed the configure script so that it no longer defaults to the
      reference configuration. This change is being made so that the
      developer has a firm awareness of which configuration is being used
      to configure BLIS. Thanks to Mike Kistler and Bryan Marker for this
      suggested change.

commit da77e9614f54f92f703f01e3b9bd67a83280150c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Sep 13 12:00:37 2013 -0500

    Minor improvements to static memory allocator.
    
    Details:
    - Expanded on cpp macro definitions from bli_mem.c and relocated them to
      a new header file, frame/include/bli_mem_pool_macro_defs.h. The expanded
      functionality includes computing the pool size for each datatype (using
      that datatype's cache blocksizes) and using the maximum to size the
      actual pool array. This addresses the somewhat common pitfall whereby a
      developer updates cache blocksizes in bli_kernel.h for only one datatype
      (say, single-precision real), while the memory pools are sized using the
      double-precision real values. Then, when the developer attempts to link
      to and run a level-3 BLIS routine (e.g. dgemm), the library aborts with
      a message saying the static memory pool was exhausted. Clearly, this
      message is misleading when the pool was not sized properly to begin with.
    - Removed previously disabled code in bli_kernel_macro_defs.h that was
      meant to check for size consistency among the various cache blocksizes.
      (Obviously the memory pool size-based solution mentioned above is better.)
    - Added BLIS_SIZEOF_? cpp macros to bli_type_defs.h. This seemed like a
      reasonable place to put these constants, rather than further crowd up
      bli_config.h.
    - Updated testsuite driver to output memory pool sizes for A, B, and C.
    - Minor comment updates to bli_config.h.
    - Removed 'flame' configuration. It was beginning to get out-of-date, and
      I hadn't used it in months. We can always re-create it later.

commit 631f347b7a99cb02757c534fd3ec5f723a2fdb0e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 10 17:17:28 2013 -0500

    Added ESSL and Accelerate targets to test drivers.
    
    Details:
    - Added ESSL and Accelerate (OS X) targets to standalone test drivers'
      Makefile in "test" directory. Thanks to Jeff Hammond for suggesting
      / providing this patch.

commit 7ae4d7a41d13ef5f1ceee217c000a5cf77a11128
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 10 16:35:12 2013 -0500

    Various changes to treatment of integers.
    
    Details:
    - Added a new cpp macro in bli_config.h, BLIS_INT_TYPE_SIZE, which can be
      assigned values of 32, 64, or some other value. The former two result in
      defining gint_t/guint_t in terms of 32- or 64-bit integers, while the latter
      causes integers to be defined in terms of a default type (e.g. long int).
    - Updated bli_config.h in reference and clarksville configurations according
      to above changes.
    - Updated test drivers in test and testsuite to avoid type warnings associated
      with format specifiers not matching the types of their arguments to printf()
      and scanf().
    - Inserted missing #include "bli_system.h" into blis.h (which was slated for
      inclusion in d141f9eeb6d1).
    - Added explicit typecasting of dim_t and inc_t to macros in
      bli_blas_macro_defs.h (which are used in BLAS compatibility layer).
    - Slight changes to CREDITS and INSTALL files.
    - Slight tweaks to Windows build system, mostly in the form of switching to
      Windows-style CRLF newlines for certain files.

commit 068437736b41d51a1f5ec47839f059bf58a20413
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Sep 9 14:07:58 2013 -0500

    Fixed set-but-not-used compiler (gcc) warnings.
    
    Details:
    - Used void-casts of certain variables to appease gcc (and perhaps other
      compilers) when such variables are only used in the complex instances of
      the functions. Special thanks to Karl Rupp for suggesting a portable fix
      for these warnings.

commit 6dc85f63dcd5282340c9e00d585e97d70a21edc3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Sep 9 13:48:52 2013 -0500

    Small fix to Windows defs.mk makefile fragment.
    
    Details:
    - Commented out a !include statement that was attempting to include a
      version file that does not yet exist. For now, the version string is
      hard-coded into defs.mk.

commit d141f9eeb6d1de7044b7429adf52d11c6fca620c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Sep 9 13:09:16 2013 -0500

    Added Windows build system.
    
    Details:
    - Added a 'windows' directory, which contains a Windows build system
      similar to that of libflame's. Thanks to Martin for getting this up
      and running.
    - Spun off system header #includes into bli_system.h, which is included
      in blis.h
    - Added a Windows section to bli_clock.c (similar to libflame's).

commit 9b320e7406fb69e8b61a0085abe2ed89a96bdb68
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Sep 9 11:04:46 2013 -0500

    Edited bli_?lamch.c to avoid Windows keyword.
    
    Details:
    - Renamed "small" variable to "smnum" to avoid collision with Windows type
      by the same name. This change is needed in advance of the upcoming Windows
      build system.

commit 9013ad6ff2e9ace35e0cf44c32795c2f3d5be628
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Sep 4 13:36:07 2013 -0500

    Switched integer typedefs (again) to C types.
    
    Details:
    - Redefined gint_t and guint_t in terms of the standard C types long int
      and unsigned long int, respectively.
    - Changed testsuite default max problem size to 500.
    - Changed testsuite input.operations to use square problems for level-3
      operation tests.

commit 981a60cfa07abac2e93697dfe12b0f076ab00a38
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Sep 4 12:09:11 2013 -0500

    Falling back to 32-bit integers for dim_t, etc.
    
    Details:
    - In light of recent segfaulting issues when compiling on 32-bit systems,
      I've changed the default typedef for gint_t and guint_t from int64_t and
      uint64_t to int32_t and uint32_t, respectively.
    - Disabled 64-bit integers in the blas2blis layer for the reference
      configuration.
    - Added type sizes of gint_t, guint_t, and the four floating-point datatypes
      to introductory output of the testsuite.

commit b776ddcd4338b34f172ef78da0ac1d771a771ab4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 3 21:58:07 2013 -0500

    Applied temp fix to typecasting bug in testsuite.
    
    Details:
    - Applied a temporary fix to the typecasting bug in the testsuite driver.
      The fix involves casting both numerator and denominator to unsigned long.
      This fix is more voodoo than science, as I can't be sure why it even
      works.

commit 9ee6e125373869c4213c017ce772c38ecefba103
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 3 21:53:27 2013 -0500

    Changed dimension spec for gemm in testsuite.
    
    Details:
    - Encounted a bizarre typecasting bug whereby the test suite was not
      computing the proper dimension from the problem size and dimension
      specification when the latter was set to -3. Will investigate.
      Thanks to Fran for finding this "bug".

commit e8be081e68c385ab44d0fea8dade21d40c200b79
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Aug 28 15:52:34 2013 -0500

    Generalized matlab and file output in testsuite.
    
    Details:
    - Added a new option in input.general that allows outputting in
      matlab/octave format so that one can output in matlab format
      independently from outputting to files.
    - Adjusted input.operations according to above.
    - Added input.operations.0 and input.operations.1 with all options
      disabled and enabled, respectively.

commit d352c746e5683037d41b5061dfb5ce08e1d0843b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Aug 27 13:41:46 2013 -0500

    Added single/real gemm micro-kernel for x86_64.
    
    Details:
    - Added a single-precision real gemm micro-kernel in
      kernels/x86_64/3/bli_gemm_opt_d4x4.c.
    - Adjusted the single-precision real register blocksizes in
      config/clarksville/bli_kernel.h to be 8x4.
    - Added a missing comment to bli_packm_blk_var2.c that was present in
      bli_packm_blk_var3.c

commit dedda523dc5dc779ecc34e6a03dc74cb8eb220de
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Aug 19 12:07:41 2013 -0500

    Fixed bug in bli_acquire_mpart_t2b(), _l2r().
    
    Details:
    - Fixed a bug in bli_acquire_mpart_t2b() and bli_acquire_mpart_l2r()
      that cause incorrect partitioning when SUBPART0 was requested. This
      bug was introduced in 46d3d09d49ad. Thanks to Bryan for isolating
      this bug.
    - Removed dupl kernels from kernels/x86_64/3 directory.
    - Uncommented beta == 0 optimizaition code in
      kernels/x86_64/3/bli_gemm_opt_d4x4.c.

commit 12dbd2f33455e9384fe2070cbdd660fd4a7fceb5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 8 14:39:35 2013 -0500

    Moved init_safe(), finalize_safe() to BLAS compat.
    
    Details:
    - Moved the bli_init_safe() and bli_finalize_safe() function calls from the
      BLAS-like BLIS layer to the BLAS compatibility layer. Having these auto-
      initializers in the BLIS layer wasn't buying us anything because the user
      could still call the library with uninitialized global scalar constants,
      for example. Thus, we will just have to live with the constraint that
      bli_init() MUST be called before calling ANY routine with a bli_ prefix.
    - Added the missing _init_safe() and finalize_safe() calls to the level-1
      BLAS compatibility wrappers.

commit 8abfe55f2ae5d89df18e1b26a5a28d94b0936683
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 8 13:30:19 2013 -0500

    Miscellaneous updates.
    
    Details:
    - Changed the BLIS_HEAP_STRIDE_ALIGN_SIZE in the configurations from 16 to
      BLIS_CACHE_LINE_SIZE (typically 64).
    - Changed the use of nr in sizing of bd buffer to packnr in level-3 macro-
      kernels.
    - Reformulated gemm_ker_var2 to look more like the other level-3 macro-
      kernels, in that the interior and edge-case handling is expressed once
      inside the loops in the n and m dimensions, rather than the edge-case
      handling being "unrolled" and expressed as distinct code regions. The
      previous macro-kernel now lives in retired form in the subdirectory
      other/bli_gemm_ker_var2.c.old.
    - Updated experimental gemm_ker_var5 according to above change.
    - Fixed bug in bli_her2k.c whereby incorrect transformations were being
      applied to optimize the macro-kernel accesses pattern on C when C is
      row-stored.
    - Various updates inside of test/exec_sizes.

commit 1aa05736ff49e7cc5f121acf615460fe9a87852c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Aug 7 12:27:04 2013 -0500

    Fixed bug in interface of bla_ger_check().
    
    Details:
    - Fixed the misplaced lda parameter in the function signature of
      bla_ger_check(). Thanks to Tyler for finding this bug.

commit 685aad25353fb200de4ca97a8bc0feeebde51d0f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Aug 6 12:25:51 2013 -0500

    Fixed cpp guard typos in frame/compat/check files.
    
    Details:
    - Fixed instances of BLIS_ENABLE_BLIS2BLAS that should have been
      BLIS_ENABLE_BLAS2BLIS. Thanks to Tyler for catching this.
    - Fixed various syntax errors in the code that had yet to be compiled
      due to the aforementioned bug.

commit f4ec28e723d28d998f1038f82da6986e44320ef6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 1 11:24:23 2013 -0500

    Added basic OpenMP-based gemm and packm files.
    
    Details:
    - Integrated Tyler's parallelized packm_blk_var2 and gemm_ker_var2
      into the following auxiliary files
    
        frame/1m/packm/other/bli_packm_blk_var2.c
        frame/3/gemm/other/bli_gemm_ker_var2.c
    
      The routine in the first file uses a basic OpenMP parallel region to
      parallelize the packing of blocks of A and panels of B, while the
      second uses a similar parallel region to parallelize along the n
      dimension of the gemm macro-kernel.

commit f8980edf9c318453bb1962ac4939c06bf11e6d5e
Merge: 67a8b949 6e7e4523
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jul 26 11:14:27 2013 -0500

    Merge branch 'master' of https://code.google.com/p/blis

commit 67a8b9498d13b038deb316ac163e62c5b17da2ec
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jul 26 11:12:37 2013 -0500

    Added missing cpp kernel blocksize constraints.
    
    Details:
    - Added missing C preprocessor guards in bli_kernel_macro_defs.h that enforce
      constraints on the register blocksizes relative to the cache blocksizes.
      Thanks to Tyler for helping me stumble across this issue.

commit 6e7e452343014e8f86640874dc1dbadca4a642a1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jul 22 14:50:57 2013 -0500

    Fixed minor warnings and misc issues.
    
    Details:
    - Fixed various warnings output by gcc 4.6.3-1, including removing some
      set-but-not-used variables and addressing some instances of typecasting
      of pointer types to integer types of different sizes.

commit 03f6c3599743bc837a7d40eb5b415b1bf4f2a4e9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jul 22 12:54:32 2013 -0500

    Tightened some macros that detect datatypes.
    
    Details:
    - Modified the definitions of some macros, such as bli_is_real(), so that
      the "special" bit is taken into account so that BLIS_INT is differentiated
      from BLIS_FLOAT.
    - Whitespace changes to bli_obj_macro_defs.h.
    - Removed BLIS_SPECIAL_BIT definition from bli_type_defs.h, since it wasn't
      being used.

commit b33e2f4443b9043b554963320280ff7783773652
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jul 19 17:15:03 2013 -0500

    CHANGELOG update (for 0.0.9).

commit 0680916fdd532f7a4716b11a2515243b2c08d00f (tag: 0.0.9)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jul 18 18:04:34 2013 -0500

    Added BLAS error checking to compatibility layer.
    
    Details:
    - Added frame/compat/check directory, which now houses companion _check()
      routines for each of the BLAS wrappers in frame/compat. These _check()
      routines are called from the compatibility wrappers and mimic the
      error-checking present in the netlib BLAS.
    - Edited bla_xerbla.c so that xerbla() translates the operation string to
      uppercase before printing.
    - Redefined util routines in frame/compat/f2c/util in terms of level0
      macros.
    - Added prototypes for util routines, f2c routines, lsame(), and xerbla().
    - Commented out prototypes in test/test_*.c since Fortran integers are now
      int64_t by default (and the prototypes that were present in the files
      used int).
    - Removed redundant #include "bli_f2c.h" in bli_?lamch.c and bli_lsame.c,
      since blis.h was already being included.
    - Other minor changes to code in frame/compat/f2c.

commit 4e80ad28c97273db3366428ec44020da7944964d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jul 18 17:53:31 2013 -0500

    Added support for C99 complex types/arithmetic.
    
    Details:
    - Added support for C99 complex types to bli_type_defs.h and overloaded
      complex arithmetic to the scalar-level macros in include/level0. This
      includes a somewhat substantial reorganization and re-layering of much
      of the existing machinery present in the level0 macros.
    - Added new #define for BLIS_ENABLE_C99_COMPLEX to bli_config.h files,
      commented-out by default, which optionally enables the use of built-in
      C99 complex types and arithmetic.
    - Minor changes to clarksville and reference configs' make_defs.mk files.
    - Removed macro definitions from bli_param_macro_defs.h which was not being
      used (bli_proj_dt_to_real_if_imag_eq0).

commit 6072d7c848e837ba20d607f7b727438ada31bdcf
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 17 12:27:45 2013 -0500

    Fixed bugs in trsm, trmm macro-kernels.
    
    Details:
    - Fixed a bug in trsm_rl_ker_var2() caused by incorrect edge case handling.
    - Fixed a bug in trsm_rl_ker_var2() and trsm_ru_ker_var2() whereby k was
      incorrectly being adjusted upward by MR, instead of NR. The rl and ru
      trmm macro-kernels were updated in a similar fashion.
    - Fixed a bug in trsm_ru_ker_var2() that was due to a missing negation on
      diagoffb when recomputing k to skip a zero region below where the
      diagonal intersects the right side of the block. The corresponding
      trmm macro-kernel was also updated.
    - Fixed a bug in trsm_ru_ker_var2() where the the adjustment of k (by NR)
      needed to be placed AFTER the block that recomputes k to skip the zero
      region (if present). The other three trsm macro-kernels, as well as the
      trmm macro-kernels, were updated in the same manner, for consistency.
    - Fixed a bug in trmm_lu_ker_var2() in which the wrong dimension (n) was
      being updated to skip a zero region to the left of where the diagonal
      of A intersects the top edge of the block.
    - Comment updates to all trsm and trmm macro-kernels.
    - Comment updates to bli_packm_init.c.

commit 47410a48f9b91e94ce4c67633686ffd1f2ad0275
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 10 14:53:59 2013 -0500

    Added f2c'ed Givens rotation wrappers.
    
    Details:
    - Retired (for now) existing ?rot*() BLAS compatibility wrappers to 'attic'
      along with other wrappers for which no BLIS implementation exists.
    - Added f2c-generated codes for applicable datatype flavors of rot, rotg,
      rotm, and rotmg operations.

commit e5f90f3a8dbe671104bcb9d8b4e3409de01805da
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 10 13:40:12 2013 -0500

    Removed copynz defs from bli_kernel.h files.
    
    Details:
    - Removed COPYNZ_KERNEL definition from the bli_kernel.h files in each
      configuration. (Meant to include this in previous commit.)

commit aec12d90f596e8c04b1ad178258a1cd38108f59d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 10 13:33:30 2013 -0500

    Removed copynzv, copynzm and related codes.
    
    Details:
    - Removed copynzv and copynzm operation directories. These operations
      implemented a variation of copyv/m that, in the case of real source
      and complex destination operands, leaves the imaginary component
      untouched (rather than setting it to zero). I realize now that the
      special case(s) (e.g. gemm with real A and B but complex C) that I
      thought required this operation actually can be handled more simply.
    - Removed level0 scalar macros implementing copynzs, copynzjs.

commit b0a0a0f274a761788531b5d281cc3b411b7124ed
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 9 17:15:38 2013 -0500

    Added handling of restrict, stdint.h for non-C99.
    
    Details:
    - Removed the #include <stdint.h> from blis.h and inserted a cpp macro block
      in bli_type_defs.h that #includes <stdint.h> for C++ and C99, and otherwise
      manually typedefs the types we need (which, for now, are unconditionally
      int64_t and uint64_t).
    - Moved basic typedefs to top of bli_type_defs.h, and comment changes.
    - Added cpp macro block to bli_macro_defs.h that #defines restrict as
      nothing for C++ and non-C99.

commit 4b7e7970f1af4a1ab121e07657e2b78b9fcd7671
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jul 8 15:20:34 2013 -0500

    Migrated integer usage to stdint.h types.
    
    Details:
    - Changed the way bli_type_defs.h defines integer types so that dim_t,
      inc_t, doff_t, etc. are all defined in terms of gint_t (general signed
      integer) or guint_t (general unsigned integer).
    - Renamed Fortran types fchar and fint to f77_char and f77_int.
    - Define f77_int as int64_t if a new configuration variable,
      BLIS_ENABLE_BLIS2BLAS_INT64, is defined, and int32_t otherwise.
      These types are defined in stdint.h, which is now included in blis.h.
    - Renamed "complex" type in f2c files to "singlecomplex" and typedef'ed
      in terms of scomplex.
    - Renamed "char" type in f2c files to "character" and typedef'ed in terms
      of char.
    - Updated bla_amax() wrappers so that the return type is defined directly
      as f77_int, rather than letting the prototype-generating macro decide
      the type. This was the only use of GENTFUNC2I/GENTPROT2I-related macros,
      so I removed them. Also, changed the body of the wrapper so that a
      gint_t is passed into abmaxv, which is THEN typecast to an f77_int
      before returning the value.
    - Updated f2c code that accessed .r and .i fields of complex and
      doublecomplex types so that they use .real and .imag instead (now that
      we are using scomplex and dcomplex).

commit 372501398564fdba3d5a3db86c30bc1039b185ff
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jul 8 11:24:18 2013 -0500

    Added experimental bli_gemm_ker_var5().
    
    Details:
    - Added support for an experimental gemm macro-kernel incrementally
      packs one micro-panel of B at a time. This is useful for certain
      special cases of gemm where m is small.
    - Minor changes to default values of clarksville configuration.
    - Defined BLIS_PACKED_BLOCKS as part of pack_t type, even though we
      do not yet have any use (or implementation support) for block storage.
    - Comment update to bli_packm_init.c.

commit 9915d667a79f23e3a2a2516247c560e9063a1646
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Jul 7 13:28:39 2013 -0500

    Defined "total" blocksize query functions.
    
    Details:
    - Defined bli_blksz_total_for_type() and bli_blksz_total_for_obj() to query
      the default blocksize plus blocksize extension (using the type or the type
      of an object).
    - Comment update in bli_packm_cxk.c.

commit 46d3d09d49aded1d9f1b468c83fce75e07d631dc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jun 27 13:19:56 2013 -0500

    Consolidated lower/upper her[2]k blocked variants.
    
    Details:
    - Consolidated lower and upper blocked variants for herk and her2k, and
      renamed the resulting variants, according to the same changes recently
      made to trmm and trsm.
    - Implemented support for four new subpartitions types:
        BLIS_SUBPART1T
        BLIS_SUBPART1B
        BLIS_SUBPART1L
        BLIS_SUBPART1R
      which correspond to "merged" partitions that include the middle "1"
      partition as well as either the neighboring "0" or "2" partition. This is
      used to clean up code in herk/her2k var2 that attempts to partition away
      the strictly zero region above or below the diagonal of a matrix operand
      that is being marched through diagonally.
    - Added safeguards to herk macro-kernels that skip any leading or trailing
      zero region in the panel of C that is passed in. This is now needed given
      that herk/her2k var1 no longer partitions off this zero region before
      calling the macro-kernel (via bli_her[2]k_int()).
    - Updated comments and other whitespace changes to trmm/trsm macro-kernels.

commit 02002ef6f3d2746665982793db36714bd69bccc9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 24 17:08:14 2013 -0500

    Added row-storage optimizations for trmm, trsm.
    
    Details:
    - Implemented algorithmic optimizations for trmm and trsm whereby the right
      side case is now handled explicitly, rather than induced indirectly by
      transposing and swapping strides on operands. This allows us to walk through
      the output matrix with favorable access patterns no matter how it is stored,
      for all parameter combinations.
    - Renamed trmm and trsm blocked variants so that there is no longer a
      lower/upper distinction. Instead, we simply label the variants by which
      dimension is partitioned and whether the variant marches forwards or
      backwards through the corresponding partitioned operands.
    - Added support for row-stored packing of lower and upper triangular matrices
      (as provided by bli_packm_blk_var3.c).
    - Fixed a performance bug in bli_determine_blocksize_b() whereby the cache
      blocksize  extensions (if non-zero) were not being used to appropriately size
      the first iteration (ie: the bottom/right edge case).
    - Updated comments in bli_kernel.h to indicate that both MC and NC must be
      whole multiples of MR AND NR. This is needed for the case of trsm_r where,
      in order to reuse existing left-side gemmtrsm fused micro-kernels, the
      packing of A (left-hand operand) and B (right-hand operand) is done with
      NR and MR, respectively (instead of MR and NR).

commit d1e81ddc848ee47bc188735883d14582bdd0cabc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jun 13 11:14:21 2013 -0500

    Minor generalizing tweaks to trmm blk var1, var2.

commit 0efb7974f104206ba3985276f2180a9b14fe9f9b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jun 12 16:40:04 2013 -0500

    CHANGELOG update.

commit 5b641c3bab31eac6a1795b9f6e3f86c59651ca50 (tag: 0.0.8)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jun 12 16:02:12 2013 -0500

    Use separate CFLAGS for "kernels" directories.
    
    Details:
    - Added a new "special" directory type: any source code within directories
      named "kernels" will be compiled with a separate CFLAGS_KERNELS set of
      compiler flags. This allows the developer to specify a separate set of
      flags (e.g. optimization flags) for compiling kernels while maintaining a
      standard set for regular framework code.
    - Fixed a bug in the top-level Makefile that was causing "noopt" code
      to be compiled with the standard set of compilation flags.
    - Updated make_defs.mk in reference, flame, and clarksville configurations
      according to above changes.

commit 08475e7c7653ba598665071a617d10f0d8f763c2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jun 11 12:18:39 2013 -0500

    Various level-3 optimizations for row storage.
    
    Details:
    - Implemented remaining two cases within bli_packm_blk_var2(), which allow
      packing from a lower or upper-stored symmetric/Hermitian matrix to column
      panels (which are row-stored). Previously one could only pack to row panels
      (which are column-stored).
    - Implemented various optimizations in the level-3 front-ends that allow more
      favorable access through row-stored matrices for gemm, hemm, herk, her2k,
      symm, syrk, and syr2k.
    - Cleaned up code in level-3 front-ends that has to do with setting target and
      execution datatypes.

commit 05a657a6b92e8d34efa5c57ae6a18a4f35ec0841
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jun 7 11:04:10 2013 -0500

    Added beta == 0 optimization to x86_64 ukernel.
    
    Details:
    - Modified x86_64 gemm microkernel so that when beta is zero, C is not read
      from memory (nor scaled by beta).
    - Fixed minor bug in test suite driver when "Test all combinations of storage
      schemes?" switch is disabled, which would result in redundant tests being
      executed for matrix-only (e.g. level-1m, level-3) operations if multiple
      vector storage schemes were specified.
    - Restored debug flags as default in clarksville configuration.

commit f1aa6b81cc421516dd77dd0f18f7c432724e6ef2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jun 6 13:36:06 2013 -0500

    Whitespace changes to old test drivers.
    
    Details:
    - Replaced tabs with four spaces in places where indention was already
      in place.

commit 9feb4c23d2e36f3d8b5417a3802c69f94b29f749
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jun 4 14:57:46 2013 -0500

    Fixed unaligned handling in axpyf, dotxaxpyf.
    
    Details:
    - Fixed over-cautious handling of unaligned operands in vector instrinsic
      implementation of axpyf kernel.
    - Fixed over- and under-cautious handling of unaligned operands in vector
      intrinsic implementation of dotxaxpyf kernel.

commit 22b06cfcd2e3205c8325a246c2279e4b1047c066
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 3 16:54:52 2013 -0500

    Updated level-1/-1f [vector intrinsic] kernels.
    
    Details:
    - Updated level-1/-1f kernels so that non-unit and un-aligned cases are
      handled by reference implementation (rather than aborted).
    - Added -fomit-frame-pointer to default make_defs.mk for clarksville
      configuration.
    - Defined bli_offset_from_alignment() macro.
    - Minor edits to old test drivers.

commit 0288c827d3659bb225ac9c10f168b623ed0106a2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jun 1 08:02:23 2013 -0500

    Updated ukernels for x86_64.
    
    Details:
    - Tweaked micro-kernels and configuration for clarksville.
    - Updated/cleaned up old test drivers in test directory.
    - Fixed syntax bug in trsv_unb_var1 and trsv_unf_var1 (introduced
      recently).

commit 85a6d1c9a52c2b27c71a3a3e341c51d7ba263749
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon May 6 11:05:08 2013 -0500

    Replaced axpys usage with subs in trsv.
    
    Details:
    - Replaced instances of axpys with alpha equal to -1 with subs.
    - Use BLIS_MAX_TYPE_SIZE to define BLIS_CONSTANT_SLOT_SIZE instead of
      sizeof(dcomplex).

commit 2d9c667f3c48a12cab64e5ad09d5fcb9f4c19d78
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri May 24 16:28:10 2013 -0500

    Fixed x86_64 kernel bugs and other minor issues.
    
    Details:
    - Fixed bugs in trmv_l and trsv_u due to backwards iteration resulting in
      unaligned subpartitions. We were already going out of our way a bit to
      handle edge cases in the first iteration for blocked variants, and this
      was simply the unblocked-fused extension of that idea.
    - Fixed control tree handling in her/her2/syr/syr2 that was not taking
      into account how the choice of variant needed to be altered for
      upper-stored matrices (given that only lower-stored algorithms are
      explicitly implemented).
    - Added bli_determine_blocksize_dim_f(), bli_determine_blocksize_dim_b()
      macros to provide inlined versions of bli_determine_blocksize_[fb]() for
      use by unblocked-fused variants.
    - Integrated new blocksize_dim macros into gemv/hemv unf variants for
      consistency with that of the bugfix for trmv/trsv (both of which now
      use the same macros).
    - Modified bli_obj_vector_inc() so that 1 is returned if the object is a
      vector of length 1 (ie: 1 x 1). This fixes a bug whereby under certain
      conditions (e.g. dotv_opt_var1), an invalid increment was returned, which
      was invalid only because the code was expecting 1 (for purposes of
      performing contiguous vector loads) but got a value greater than 1 because
      the column stride of the object (e.g. rho) was inflated for alignment
      purposes (albeit unnecessarily since there is only one element in the
      object).
    - Replaced some old invocations of set0 with set0s.
    - Added alpha parameter to gemmtrsm ukernels for x86_64 and use accordingly.
    - Fixed increment bug in cleanup loop of gemm ukernel for x86_64.
    - Added safeguard to test modules so that testing a problem with a zero
      dimension does not result in a failure.
    - Tweaked handling of zero dimensions in level-2 and level-3 operations'
      internal back-ends to correctly handle cases where output operand still
      needs to be scaled (e.g. by beta, in the case of gemm with k = 0).

commit d57ec42b34f8447c88adeffa95cf22f8c115ad51
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri May 3 17:35:32 2013 -0500

    Renamed _trans_status() macro.
    
    Details:
    - Mistakenly forgot to rename the _trans_status() macro and instances in
      previous commit.

commit 9e2b227866af429a4a6fb7dbb8c457bbdda2f136
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri May 3 17:24:58 2013 -0500

    Renamed _set_trans(), _trans_status() macros.
    
    Details:
    - Renamed the following macros:
        bli_obj_set_trans()    -> bli_obj_set_onlytrans()
        bli_obj_trans_status() -> bli_obj_onlytrans_status()
      to remove ambiguity as to which bits are read/updated.

commit 2f8174509ea9f844db11ebd9389de5168e85b132
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed May 1 15:06:30 2013 -0500

    Unconditionally check memory pool(s) for errors.
    
    Details:
    - Changed bli_mem_acquire_m() in bli_mem.c so that we still check if the
      memory pool is exhausted before checking out and returning a block, even
      if BLIS error checking has been disabled. These errors are useful because
      they likely indicate that BLIS was improperly configured for the code
      being run.

commit 75405a2b83679b6aff38d7e7425199d623a7b0a9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed May 1 15:00:30 2013 -0500

    CHANGELOG update.

commit 6bfa96f84887dec0b4cf8be5d38dd634c2f8951d (tag: 0.0.7)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 30 19:35:54 2013 -0500

    Absorbed blocksize extensions into main objects.
    
    Details:
    - Revamped some parts of commit b6ef84fad1c9 by adding blocksize extension
      fields to the blksz_t object rather than have them as separate structs.
    - Updated all packm interfaces/invocations according to above change.
    - Generalized bli_determine_blocksize_?() so that edge case optimization
      happens if and only if cache blocksizes are created with non-zero
      extensions.
    - Updated comments in bli_kernel.h files to indicate that the edge case
      blocksize extension mechanism is now available for use.

commit bc7c8005cedbe50961ac2a99aeeabf4e9f9a8e9e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 25 17:16:59 2013 -0500

    Added option to disable err checking in testsuite.
    
    Details:
    - Added a new line to input.general that allows one to specify the error-
      checking level to use for each BLIS experiment. The only two levels
      supported for now are "no error checking" and "full error checking".

commit 096b366ddcfe386f44419ef84d8df8be13825f86
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 25 16:43:43 2013 -0500

    Use cntl trees that block in n dimension.
    
    Details:
    - Updated _cntl.c files for each level-3 operation to induce blocked
      algorithms that first paritition in the n dimension with a blocksize
      of NC. Typically this is not an issue since only very large problems
      exceed that of NC. But developers often run very large problems, and
      so this extra blocking should be the default.
    - Removed some recently introduced but now unused macros from
      bli_param_macro_defs.h.

commit b6e24b23cb4dfc488c1c9c70d596539c2287f72e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 25 12:06:12 2013 -0500

    Use PASTEMAC in macro-kernels (over MAC2 or MAC3).
    
    Details:
    - Replaced multi-type invocations of copys_mxn, xpbys_mxn, etc. (PASTEMAC2
      and PASTEMAC3) with those that only use a single type (PASTEMAC).
    - Added extra macros to bli_adds_mxn_uplo.h and bli_xpbys_mxn_uplo.h to
      accommodate above change.
    - Fixed comment typo in bli_config.h files.
    - Added .nfs* pattern to .gitignore.

commit df80acf517dde180ddcc5835c6136b2fa7556d4b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 23 19:43:23 2013 -0500

    Fixed computation of b_next in L3 macro-kernels.
    
    Details:
    - Restructured herk_l and herk_u macro-kernels in the imagine of trmm
      and trsm, in that the edge cases are captured by the main loop, rather
      than trying to have "cleanup" sections that result in four distinct
      parts (interior, bottom edge, right edge, bottom-right edge) of the
      code.
    - Fixed the way b_next was being computed in the non-gemm level-3
      macro-kernels (herk, trmm, trsm). The way they are computed now matches
      that of gemm.

commit 3671528cf8efe4b445d196665143a5c50c2c6048
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 23 19:12:14 2013 -0500

    Fixed minor bug in computing b_next in gemm.

commit db072a5b4a039a9a668ef951333ecfb5bd3a74b9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 23 17:49:10 2013 -0500

    Fixed rare edge case bug in herk_l macro-kernel.
    
    Details:
    - Fixed a potential bug in herk_l at the m_left edge case. If MR was
      chosen to be much larger than NR, then one could encounter edge cases
      in the the MC dimension that fall entirely below the diagonal, which
      the previous implementation of the herk_l macro-kernel was not allowing
      for.

commit 1dab11e37d1cb403cbe75b73a644c00de534f104
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 23 17:17:11 2013 -0500

    Updated x86 gemmtrsm ukernels to use alpha.

commit 9d10d7dd9bc92a993fea7162bfa5983f75506f49
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 23 16:00:18 2013 -0500

    Added a_next, b_next arguments to micro-kernels.
    
    Details:
    - Added two more arguments to the gemm and gemmtrsm microkernels: the
      addresses of the next micro-panels of A and B. By passing these
      pointers into the micro-kernel, we allow the micro-kernel author to
      prefetch micro-panels of A and B as necessary (though this is
      completely optional; these addresses may also be safely ignored).
    - Updated all seven macro-kernels so that they compute and pass in
      a_next and b_next. Note that ONLY the gemm macro-kernel computes
      a_next and b_next with the precise semantics we want. I will go back
      and fix the other macro-kernels in the near future.
    - Added 'restrict' to various micro-kernels from which it was missing.

commit f3815dc84d385c514a5acaf1e925424a57be2f51
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 23 11:12:33 2013 -0500

    Added code for backward edge-case blocking.
    
    Disabled:
    - Edited bli_determine_blocksize_b() to include experimental (and
      currently disabled) code that computes extended blocks.
    - Updated commnts relate to above changes.
    - Enabled use of x86 gemmtrsm ukernel in config/flame/bli_kernel.h.

commit 4fe1435f20e8fc7dd72f795ac58c8e236e6c631b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 22 19:00:43 2013 -0500

    Updated dupl implementation to use PACKNR and NR.
    
    Details:
    - Updated frame/util/dupl/bli_dupl_unb_var1.c to utilize PACKNR and NR
      explicitly so navigate b1 so that situations where PACKNR > NR are
      supported.
    - Moved the 4x2 and 4x4 reference micro-kernels in frame/3/gemm/ukernels and
      frame/3/trsm/ukernels to kernels/c99/.
    - Updated clarksville and flame configurations.

commit 2d6f9e83799a46d52d7901e275f8fd67f0a0edc6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Apr 21 15:10:34 2013 -0500

    Disabled blocksize checks for memory pools.
    
    Details:
    - Temporarily disabled checks that ensure that enough memory will be allocated
      by the contiguous memory allocator for all types, given that the values for
      double precision real are the ones used to allocate the space. These checks
      can easily go awry in certain situations, especially if you are developing for
      only one datatype. So for now, they are probably more trouble than they are
      worth.

commit b6ef84fad1c9884c84b7f1350a0bcdfe1737e8f2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Apr 21 15:00:24 2013 -0500

    Allow ldim of packed micro-panels != MR, NR.
    
    Details:
    - Made substantial changes throughout the framework to decouple the leading
      dimension (row or column stride) used within each packed micro-panel from
      the corresponding register blocksize. It appears advantageous on some
      systems to use, for example, packed micro-panels of A where the column
      stride is greater than MR (whereas previously it was always equal to MR).
    - Changes include:
      - Added BLIS_EXTEND_[MNK]R_? macros, which specify how much extra padding
        to use when packing micro-panels of A and B.
      - Adjusted all packing routines and macro-kernels to use PACKMR and PACKNR
        where appropriate, instead of MR and NR.
      - Added pd field (panel dimension) to obj_t.
      - New interface to bli_packm_cntl_obj_create().
      - Renamed bli_obj_packed_length()/_width() macros to
        bli_obj_padded_length()/_width().
      - Removed local #defines for cache/register blocksizes in level-3 *_cntl.c.
      - Print out new cache and register blocksize extensions in test suite.
    - Also added new BLIS_EXTEND_[MNK]C_? macros for future use in using a larger
      blocksize for edge cases, which can improve performance at the margins.

commit 59fca58dbe678d79c1df0916b022afbeac7c48fa
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Apr 19 15:26:29 2013 -0500

    Fixed bug in compatibility layer (her2k/syr2k).
    
    Details:
    - Fixed a bug in the BLAS compatibility layer, specifically in bla_her2k.c
      and bla_syr2k.c, that caused incorrect computation to occur when the BLAS
      interface caller requests the [conjugate-]transpose case. Thanks to Bryan
      Marker for reporting the behavior that led to this bug.

commit 09eacbd1ab1380a95a0e9625726b45e43ed102d6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 18 19:39:13 2013 -0500

    Changed old level3 test drivers to call front-ends.
    
    Details:
    - Changed old level-3 test drivers, in 'test' directory, to always call the
      front-end object API instead of the internal back-end with the locally
      defined control tree.

commit 83e45de23e565138b8fde06fb11cfedc973b7246
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 18 18:33:03 2013 -0500

    Allow packm_init() to reacquire a too-small mem_t.
    
    Details:
    - Changed bli_packm_init() to react differently to a situation where a pack
      obj_t has an already-allocated mem_t entry that has a buffer that is smaller
      than what will be needed to hold the block/panel that now needs to be
      packed. Previously, this situation was treated with an abort() since I
      assumed something was horribly wrong. I have changed the code so that it now
      reacts by releasing the previous mem_t and re-acquires a new mem_t with the
      new information. (This change was done at the request of Bryan Marker to
      facilitate code generation via DxT.)

commit a6990434173b0cf651f8521194f3aef738deb7d2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 18 13:52:47 2013 -0500

    Fixed bug in packing block of A for hemm/symm.
    
    Details:
    - Fixed a bug in bli_packm_blk_var2() that affected the packing functionality
      of hemm and symm. The bug occurs whenever attempting to pack a Hermitian or
      symmetric matrix where the block of A being packed intersects the diagonal,
      but some of its micro-panels do not intersect the diagonal and lie completely
      in the unstored region. Thanks to Francisco Igual for reporting this bug.
    - Comment updates to both _blk_var2.c and _blk_var3.c.

commit c92e7590e1934f830814ab614c794215ebe0c415
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Apr 17 20:53:29 2013 -0500

    Activated bli_packm_acquire_mpart_t2b().
    
    Details:
    - Removed the overly-paranoid bli_abort() from the end of
      bli_packm_acquire_mpart_t2b(), to allow others to experiment with
      partitioning through packed blocks of A. Also, and more importantly,
      changed an earlier check that was causing an erroneous (but
      coincidentally redundant) abort(). Also, updated some of the comments
      in bli_packm_part.c.

commit bea579e9f009a44e08008eb14d09f38748ab2b53
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 16 19:43:14 2013 -0500

    Allow creation of "empty" objects.
    
    Details:
    - Modified bli_obj_alloc_buffer() to allow allocating an empty buffer, and
      modified bli_adjust_strides() to explicitly handle m = n = 0.
    - Updated bli_check_matrix_strides() to allow cases where m = n = 0.

commit 7904e20f2e6908571ee5008da2a08084198eefae
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 16 17:37:16 2013 -0500

    Fixed "root" object bug in bli_her[2]k/syr[2]k.
    
    Details:
    - Fixed an obscure bug in the front-ends for herk, her2k, syrk, and syr2k,
      that manifested as the incorrect triangle being updated. It occurred when
      the user would pass in a matrix object that was correctly marked as
      symmetric/Hermitian and lower-stored, but whose root object was never marked
      as lower (or upper). We now alias and re-assign root status for matrix C
      within the front-ends. Note that trmm and trsm were already doing this,
      albeit for a slightly different reason (to allow the internal back-end to
      choose which algorithm to run--lower or upper--based on the uplo of the root
      object for both left and right side cases). Thanks to Bryan Marker for
      leading me to this bug.

commit 19155a768dd97b57cfb59c32fa8e54a344ec66e1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 16 11:24:03 2013 -0500

    Fixed overzealous type-checking in bli_getsc().
    
    Details:
    - Relaxed type checking in getsc so that the input object could be a constant
      and not just a proper floating-point type. (If it is a constant, default to
      extracting the dcomplex values.) Thanks to Bryan Marker for reporting this
      bug.
    - Added definition for bli_is_constant() in bli_param_macro_defs.h
    - Comment updates to various level-0 scalar routines.

commit 2ee6bbca2953d04c967685da9735b3eaf8a4b813
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 15 19:27:57 2013 -0500

    Fixed bug in bli_obj_is_packed() and renamed.
    
    Details:
    - This macro is used to determine whether the partitioning routines should
      call a corresponding packm_part routine instead. However, it was
      unintentionally catching matrices that were marked as "packed" by virtue
      of them simply being marked as BLIS_PACKED_UNSPEC in, say, bli_gemv().
      The macro has now been renamed to bli_obj_is_panel_packed(), and now only
      checks for row or column panel packing. (Note that I first attempted to
      fix this bug in a571af816d72.) Thanks to Bryan Marker for reporting the
      erroneous behavior that led me to this bug.

commit 99b99eebe70336b5f28039a4a084aa7f5fa7059d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 15 17:54:43 2013 -0500

    Removed local reference ukernel blocksize macros.
    
    Details:
    - Removed locally defined gemm microkernel blocksize macros from _mxn
      reference microkernel definition and header. Meant to include this in
      a recent/previous commit (0020ef7c8271).

commit 6a538fa7b164655f41cea5b9c8d3902438bda66b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 15 14:40:31 2013 -0500

    Formatting change to mods in previous commit.

commit ea079d35591e808971d2d98a1a7d9f89bc1f7c2f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 15 14:31:40 2013 -0500

    Set structure of objects in level-2 BLIS APIs.
    
    Details:
    - Added missing statement to set structure field of local objects in
      top-level BLIS (BLAS-like) API wrappers. Thanks to Bryan Marker for
      reporting this bug.

commit d9948c541c0446e20e249a1ccc83709ce51b7aa8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 15 10:21:26 2013 -0500

    Tweak to test suite function string construction.
    
    Details:
    - Fixed a minor bug in the way that the test suite would construct function
      name strings when the user anchored all parameters in input.operations.
      In this case, the test driver would mistake this situation for one where
      the operation simply had no parameters to begin with, and thus would not
      include the parameter string in the function string that is output for
      every result.

commit ca9e435c57c5c7a000d2a32681dd8070ba850abd
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 15 09:59:46 2013 -0500

    Fixed a bug in reference implementation of dupl.
    
    Details:
    - Fixed a bug in reference implementation of dupl (bli_dupl_unb_var1.c),
      which resulted in incorrect duplication.
    - Updated old test drivers according to recently updated packm control tree
      creation interface.
    - Added 'restrict' to x86 gemm microkernel interface.

commit 26cbd52e364bbe439e3744101cd5a6cbcb82dffd
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Apr 14 19:05:33 2013 -0500

    Modified bli_kernel.h include order in blis.h.
    
    Details:
    - Delayed #include of bli_kernel.h in blis.h to prevent a situation where
      _kernel.h includes an optimized microkernel header, which uses BLIS types
      such as dim_t and inc_t, which would precede the definition of those types
      in bli_type_defs.h.
    - Moved the #include of bli_kernel_macro_defs.h in bli_macro_defs.h to blis.h
      (immediately after that of bli_kernel.h).

commit 3414a23c38b0de45a8034b3dda2fc4b5a755e4e1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Apr 13 16:53:16 2013 -0500

    CHANGELOG update.

commit ec16c52f2ecf419c749175ce0a297441c10f1c68 (tag: 0.0.6)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Apr 13 16:41:16 2013 -0500

    Updated INSTALL file (now redirects to website).

commit 0020ef7c82711a7ebf08e5174f939bee2563184c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Apr 13 15:26:35 2013 -0500

    Removed gemmtrsm-, trsm-specific blocksize macros.
    
    Details:
    - Modified gemmtrsm micro-kernel wrappers to use new aliased blocksize macros
      instead of operation-specific ones.
    - Removed local, gemmtrsm-specific blocksize macro definitions found in
      micro-kernel header files.
      (Meant to include above changes in 31b100e7bf4a.)
    - Added comments to reference gemmtrsm micro-kernel wrapper implementation.

commit 1a9f427b85bb95aaa9e54c8ff8ecad8734b361ee
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Apr 12 15:25:54 2013 -0500

    Added/renamed alignment constants to _config.h.
    
    Details:
    - Added new memory alignment constants:
        BLIS_HEAP_STRIDE_ALIGN_SIZE   (previously assumed to be same as SYSTEM_MEM)
        BLIS_CONTIG_ADDR_ALIGN_SIZE   (previously assumed to be same as PAGE_SIZE)
        BLIS_STACK_BUF_ALIGN_SIZE     (previously not enforced)
      and renamed existing ones
        BLIS_SYSTEM_MEM_ALIGN_SIZE -> BLIS_HEAP_ADDR_ALIGN_SIZE
        BLIS_CONTIG_MEM_ALIGN_SIZE -> BLIS_CONTIG_STRIDE_ALIGN_SIZE
      to better convey what the alignment factor is used for (and what it is
      not used for).
    - Removed BLIS_ENABLE_SYSTEM_MEM_ALIGN. Dynamic memory alignment is now
      disabled by setting BLIS_HEAP_STRIDE_ALIGN_SIZE to 1.
    - Inserted instances of __attribute__((aligned(BLIS_STACK_BUF_ALIGN_SIZE)))
      into macro-kernels to specify stack alignment of temporary buffers.
    - Modified test suite driver to output new constants.
    - Removed bli_align_dim_to_sys() and bli_align_dim_to_cmem(). Instead, we now
      use bli_align_dim_to_size(), which takes a third argument (the desired
      alignment).

commit a77d10e87e3c0ab55ec14d74c285bc95c06285c3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Apr 12 11:40:55 2013 -0500

    Fixed an bug in axpyv/axpym when alpha is unit.
    
    Details:
    - Fixed bug whereby axpyv and axpym were incorrectly simplifying to a copy,
      rather than an add, when alpha = 1. Thanks to Bryan Marker for identifying
      this bug.

commit 0495bd1d6de5995fe2fb79b321eec79e961eb7a5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 11 16:39:25 2013 -0500

    Moved _POSIX_C_SOURCE def to compiler cmd line.
    
    Details:
    - Removed the #define of _POSIX_C_SOURCE in bli_config.h (for both reference
      and clarksville configurations) and added "-D_POSIX_C_SOURCE=200112L" to
      the compiler command line arguments in make_defs.mk (for both configs).
      Thanks to Devin Matthews for suggesting this change.

commit d43d1a0a2ef6de4bc57627566aef8e3fdb458b8c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 11 16:28:17 2013 -0500

    Appended 'f2c_' to abs, min, max macros in f2c.h.
    
    Details:
    - Renamed abs, min, max, dmin, and dmax macros in bli_f2c.h so that they
      would not conflict with anything defined by the user (or the language).
      Thanks to Devin Matthews for suggesting this fix.
    - Updated all instances of the above macros accordingly.

commit 31b100e7bf4aeaa4ceafefd2b6c3102d5fbc4cbb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 11 11:11:52 2013 -0500

    Added new kernel blocksize macro aliases.
    
    Details:
    - Added new macros that alias level-3 cache and register blocksize macros
      to names that can be constructed via the PASTEMAC macro. These aliased
      macro definitions live inside bli_kernel_macro_defs.h, which is now
      #included after bli_kernel.h.
    - Modified macro-kernels to use new aliased blocksize macros instead of
      operation-specific ones.
    - Removed local, operation-specific kernel blocksize macro definitions
      (found in macro-kernel header files).

commit bd2b24ba65b36d7c07c5918a3838ce2ff57c4b48
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 11 10:35:39 2013 -0500

    Updated CREDITS file.

commit 79328c15410215737f3f14cd069328cf52aa11fd
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 11 10:32:14 2013 -0500

    Reverted testsuite object files' home to 'obj'.
    
    Details:
    - Removed 'obj' and 'lib' from .gitignore.
    - Added testsuite/obj/.gitkeep (which is an empty file).
    - Updated testsuite/Makefile accordingly.
    - Thanks to Vernon Austel for pointing out the .gitkeep trick to tracking
      empty directories in git.

commit 4afe3bfd82c03e1e97b58b7d250588a0d28541e5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 9 17:45:39 2013 -0500

    Renamed/moved object scalar constant macros.
    
    Details:
    - Replaced scalar constant macro definitions in bli_const_defs.h with a single,
      simplier macro in bli_obj_macro_defs.h.
    - Updated invocations of old macros accordingly.
    - Removed bli_const_defs.h.

commit 357893f5be5c56ab7b062874005e77e614b23f06
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 9 14:48:15 2013 -0500

    Applied fix from prev commit to gemmtrsm_?_ref_4x4
    
    Details:
    - Fixed hard-coded kernels in bli_gemmtrsm_l_ref_4x4.c and
      bli_gemmtrsm_u_ref_4x4.c.

commit 54988e8dca44475610bcaee5a7bc1c40e8921402
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 8 19:08:43 2013 -0500

    Fixed a performance bug in trsm.
    
    Details:
    - Fixed a bug in the reference implementations of the gemmtrsm wrappers
      (bli_gemmtrsm_l_ref_mxn.c and bli_gemmtrsm_u_ref_mxn.c) whereby the
      reference gemm microkernel was hard-coded, and thus always called, even
      when GEMM_UKERNEL was defined to point to an optimzied microkernel. This
      manifested as artificially low trsm performance for all problem sizes, but
      especially for small problem sizes as it only affected blocks of A that
      intersected the diagonal. Thanks to Mike Kistler of IBM for helping me
      find this bug.

commit a7252e40b5c351eef9a1df531ea0ef25cb5fb705
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 8 16:08:22 2013 -0500

    Generate testsuite objects 'src'.
    
    Details:
    - Tweaked the testsuite makefile so that object files are stored in 'src'
      rather than 'obj', since (a) the top-level .gitignore dictates that
      obj directories are to be ignored, and (b) since git has problems
      tracking empty directories. Now, users do not need to create their own
      obj directories within their own local clones of BLIS.

commit 803871c55b60d3c225ad9a0607fa507a9c16aab7
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 8 15:18:42 2013 -0500

    Minor formatting changes.

commit a571af816d72727e16cad37007e7043b9d6fa362
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 8 15:00:13 2013 -0500

    Fixed definition of bli_is_packed_object() macro.
    
    Details:
    - Changed the definition of bli_is_packed_object() so that it keys off of the
      value of the pack schema bits in the info field of obj_t, rather than
      comparing the obj_t buffer with that of the mem_t entry. This was the cause
      of a very low probability bug whereby uninitialized memory caused the macro
      to evaluate to TRUE even though the object in question was not packed.
      Thanks to Vernon Austel of IBM for helping discover this bug.
    - Changed an abort() in bli_packm_part() to a not-yet-implemented.

commit 3be14c32f735ecc6169d3ab6370cf8b69162acec
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Apr 6 12:54:45 2013 -0500

    Updated information in testsuite output header.
    
    Details:
    - Added to the information that is echoed at the beginning of the test suite's
      output, and also re-labeled some existing information.

commit 874707c1b183a4dd9a91dbfd4ea1522384c190df
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Apr 5 17:19:43 2013 -0500

    Fixed edge case handling bug in herk macrokernels.
    
    Details:
    - Fixed a bug present in bli_herk_l_ker_var2() and bli_herk_u_ker_var2() that
      only manifests when BLIS is configured such that MR != NR. The bug involves
      incorrectly detecting edge cases, which resulted in some parts of matrix C
      potentially being skipped and not updated, depending on the problem size.
    - Updated the default values of MR and NR in config/reference/bli_kernel.h to
      8 and 4, respectively, so that I can better stress the framework on a
      day-to-day basis. (The fact that they were both equal to 4 for so long is
      why I did not stumble upon this bug much sooner.)

commit 7cbda15291d3e01300e71c286b9657b7ef0708bf
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 4 15:25:43 2013 -0500

    Added reference microkernels for arbitrary MR, NR.
    
    Details:
    - Added a new set of reference gemm, gemmtrsm, and trsm micro-kernels that
      contain explicit loops over MR and NR, thus allowing them to be used
      unmodified by developers who want to build a reference library with
      custom register blocksizes.
    - Changed config/reference/bli_kernel.h to use above ukernels by default.
    - Changed interfaces of new and existing gemm, gemmtrsm, and trsm micro-kernels
      to use 'restrict' keyword.
    - Added -funroll-loops option to config/reference/make_defs.mk.
    - Updated comments in bli_kernel.h describing constraints on register and
      cache blocksizes.
    - Updated _adds_mxn.h, _copys_mxn.h, and _xpbys_mxn.h macros files so that
      single-char macros are also defined.

commit 6684b73d5501f91d24a79e26655a42819c9b3114
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 2 13:06:20 2013 -0500

    Implemented amax operation and related changes.
    
    Details:
    - Implemented amax operation in BLIS.
    - Activated BLAS2BLIS routine mapping for new amax BLIS implementation.
    - Added integer support to [f]printv, [f]printm.
    - Added integer support to level-0 copys macros.
    - Updated printing of configuration information in test suite driver.
    - Comment changes to _config.h files.
    - Added comments to bla_dot.c to reminder reader what sdsdot()/dsdot() are
      used for.

commit fb68087f8727cd5fd656a742a110e54fb1c91db9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 26 15:10:16 2013 -0500

    More memory alignment-related tweaks.
    
    Details:
    - Renamed BLIS_MEMORY_ALIGNMENT_SIZE to BLIS_CONTIG_MEM_ALIGN_SIZE.
    - Renamed BLIS_ENABLE_MEMORY_ALIGNMENT to BLIS_ENABLE_SYSTEM_MEM_ALIGN.
    - Added BLIS_SYSTEM_MEM_ALIGN_SIZE, which controls only the alignment
      passed into posix_memalign() or equivalent.
    - Defined new function, bli_align_dim_to_cmem(), which applies the
      contiguous memory alignment (rather than the system/malloc alignment).

commit 9682ef61dbf9a8846c8b0826d4de24bc216cd641
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 26 14:14:53 2013 -0500

    Always define memory alignment size cpp constant.
    
    Details:
    - Removed guard around #define for memory alignment size constant.
      Memory alignment should always be enabled, and so this value should
      always be defined.

commit 3a787cccaae16531474f34398e3c0cf4f49b8cd8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 26 13:59:19 2013 -0500

    Renamed memory alignment macro constant.
    
    Details:
    - Renamed all occurrences of BLIS_MEMORY_ALIGNMENT_BOUNDARY to
      BLIS_MEMORY_ALIGNMENT_SIZE.

commit 37308f9a502b56d94fa52a7df71c676a46c3be3d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 26 12:43:14 2013 -0500

    Align packed panel strides with system alignment.
    
    Details:
    - Pass panel strides through bli_align_dim_to_sys() to ensure that each
      subsequent packed panel of A and B begins at an aligned address. (The
      first panel is presumably aligned to system alignment because it is
      aligned to a page boundary, which is typically much larger.)
    - Rearranged code in packm_init_pack() to prevent additional conditional
      blocks as a result of the aforementioned change.
    - Adjusted contiguous memory allocator so that the system memory alignment
      is used to allocate enough space for each block no matter what kind of
      register blocking is used (even if register blocksize is unit and every
      row/column needs maximal padding).
    - Adjusted default blocksizes in reference configuration so that MC*KC
      and KC*NC result in identical footprints for all datatypes.

commit 40a0654ada5f256beb3da80ebba015a3c71fb61f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Mar 24 20:18:12 2013 -0500

    CHANGELOG update.

commit b65cdc57d9e51fa00e3c03539cfb7e045707d0f4 (tag: 0.0.5)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Mar 24 20:01:49 2013 -0500

    Migrated 'bl2' prefix to 'bli'.
    
    Details:
    - Changed all filename and function prefixes from 'bl2' to 'bli'.
    - Changed the "blis2.h" header filename to "blis.h" and changed all
      corresponding #include statements accordingly.
    - Fixed incorrect association for Fran in CREDITS file.

commit 132bffcef7441f32d02cc7485aef6a0648e0ef1e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Mar 24 18:49:36 2013 -0500

    Removed several 'old' directories and files.
    
    Details:
    - Removed most of the 'old' directories scattered throughout the framework,
      which includes alternate/half-baked/broken implementations.

commit 551ea4767a3ea6c263f12aaca94bc2642cee4cfa
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Mar 24 18:00:10 2013 -0500

    Removed #include "blis2.h" from low-level headers.
    
    Details:
    - Removed #include of "blis2.h" from various lower-level, operation-specific
      header files throughout the framework. Given that these low-level headers
      are included within #blis2.h in a very specific order, #include'ing blis2.h
      within them directly is unnecessary.

commit bc7b318ed0960edeb4537797dd8c91de0d942ca9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Mar 22 17:18:58 2013 -0500

    Added cpp guards to conflicting libflame typedefs.
    
    Details:
    - Added cpp guards around the definitions of dim_t, scomplex, and dcomplex.
      This is a temporary hack to allow interoperability with libflame. (Similarly
      temporary changes are being made to libflame's type definitions file.)

commit f469907503fcdc24dff0174c569170e6e756e045
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Mar 22 15:20:15 2013 -0500

    Renamed MAX_PREFETCH_BYTE_OFFSET to MAX_PRELOAD_.
    
    Details:
    - Renamed BLIS_MAX_PREFETCH_BYTE_OFFSET to
      BLIS_MAX_PRELOAD_BYTE_OFFSET since "prefetch" is kind of a loaded word
      (e.g. "prefetch" instructions, which are different than the particular
      kind of prefetching/preloading referred to by this constant).

commit d1023bfbc6668a58a01ee4f82ded2319911e7b19
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Mar 22 15:09:59 2013 -0500

    Removed build/old directory.

commit 718888849c48d99f83eea6b8f83bc1998cffef7e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Mar 22 15:07:01 2013 -0500

    Deprecated 'flame' configuration.
    
    Details:
    - Removed 'flame' configuration, as it was horribly out-of-date.
    - Comment changes to bl2_blocksize.c and bl2_mem.c.

commit bba38cf4e9d28058c14483f44fa074a6d2852ad9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 19 18:07:40 2013 -0500

    Added missing conjbeta argument to scald.

commit 1f82b51d06d0279dded3f2b87ba59403f3ed0af6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Mar 18 15:37:20 2013 -0500

    Relocated packed mem_t dimension fields to obj_t.
    
    Details:
    - Removed the m and n (and elem_size) fields from the mem_t object, and added
      m_packed and n_packed fields to obj_t. These new fields track the same as
      the old ones. From an abstraction standpoint, it seemed awkward to store
      those dimensions inside the mem_t.
    - Updated interfaces to bl2_mem_acquire_*() so that only a byte size argument
      is passed in, instead of m, n, and elem_size.
    - Updated bl2_packm_init_pack() and bl2_packv_init_pack() to inline the
      functionality of bl2_mem_alloc_update_m() and bl2_mem_alloc_update_v(),
      respectively.
    - Updated packm variants to access the packed length and width fields from
      their new locations.

commit 36c782857bf9b8ac1b1dac47a70f689a4407e2cc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Mar 18 10:37:03 2013 -0500

    CHANGELOG update.

commit e7d41229d3b1674e74f47d7f29fae004a745201a (tag: 0.0.4)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Mar 15 17:12:36 2013 -0500

    Re-implemented contiguous memory allocator.
    
    Details:
    - Completely re-wrote the contiguous memory allocator (bl2_mem.c). The new
      allocator instantiates and initializes three separate memory pool objects,
      each one associated with a separate array of contiguous memory blocks, each
      block of fixed and uniform size. (The three pools are for allocating mc-by-kc
      blocks of A, kc-by-nc panels of B, and mc-by-nc panels of C.) The pool
      objects use a stack structure internally to track which blocks in the region
      have been "checked out" to a thread and which are still available. Critical
      regions are now clearly marked and adaptable to parallel environments (e.g.
      OpenMP). Memory pools are set up when bl2_init() is called.
    - Added a new field to the packm control tree node, which indicates what kind
      of packed buffer is being allocated. The enumerated type for this argument
      is defined as packbuf_t in bl2_type_defs.h.
    - Updated level-3 _cntl.c files to pass in the appropriate value for a new
      packbuf_t argument to bl2_packm_cntl_obj_create().
    - Moved some macros called by packm_init_pack() from bl2_obj_macro_defs.h to
      bl2_mem_macro_defs.h.
    - Added BLIS_MAX_NUM_THREADS to bl2_config.h, which we use as the default
      number of blocks of A reserved for the memory allocator.
    - Deprecated bl2_align_dim(). Replaced usage with that of
      bl2_align_dim_to_mult(). Turns out that typically we don't need to align
      a dimension to the system alignment, since that value has to do with
      starting addresses, whereas the values we are dealing with are unitless
      dimensions.

commit 1e76cae00cb0a04544aaae1ade878686b238d283
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Mar 15 12:21:42 2013 -0500

    Perform her2k var1 loops in sequence.
    
    Details:
    - Changed variant 1 of her2k so that the two rank-k products are computed
      and accumulated in sequence rather than fused into one loop. This is
      necessary if BLIS is to be configured to provide only enough contiguous
      memory for one panel of B.

commit c95c270eba91ae4efc26603beddfd0292caa919b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Mar 7 14:42:15 2013 -0600

    Enhanced tracking of dimensions for mem_t objects.
    
    Details:
    - Added new fields to mem_t struct definition to track the allocated (as
      opposed to the currently used) dimensions of the memory region. This
      allows packm_init() to be more robust in situations where memory is
      already allocated but is more than needed for the current packing job.
    - Updated logic in bl2_obj_set_buffer_with_cached_packm_mem() macro, used
      in packm_init(), to update the "currently used" dimensions of the mem_t
      object if the requested dimensions are smaller than the allocated
      dimensions.

commit e99281a0f41d482fddeffa239bfc8e13e6d13d4b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Mar 7 14:00:10 2013 -0600

    Fixed test suite flop formulas for ops with side.
    
    Details:
    - Fixed incorrect flop counts in test suite modules for hemm, symm, trmm,
      trmm3, and trsm.
    - Comment updates in herk macro-kernels.

commit ef8cbfc44dd620fdcbdb51cdb173217194bebe31
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Mar 2 12:47:06 2013 -0600

    Added "version" to .gitignore.
    
    Details:
    - Added "version" to .gitignore file so that the file does not show up when
      running 'git status', or accidentally get pulled into the index when
      running 'git add' or 'git add --all'.

commit e9e0747c2f6c178f53ac46ab794acbb7b8c4fea8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Mar 2 12:43:54 2013 -0600

    Removed version file from version control.
    
    Details:
    - Removed version file from version control to prevent git errors that occur
      when trying to pull new commits.

commit bb612f864e9c17dd9805e9446840f02259619469
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Mar 1 12:55:42 2013 -0600

    Updated behavior of bl2_obj_induce_trans() macro.
    
    Details:
    - Changed bl2_obj_induce_trans() so that the transposition bit is no longer
      updated as part of the macro. All current uses of the macro have been
      coupled with instances of bl2_obj_set_trans() to clear the bit.
    - Added Jed to CREDITS file.

commit f24e29b789e7314764a818ceb3063126936c986f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Feb 22 18:15:41 2013 -0600

    Replaced banded/packed BLAS2 stubs with f2c code.
    
    Details:
    - Retired the blas2blis wrappers that simply called abort with a "not yet
      implemented" message. This includes all of the level-2 banded and packed
      routines.
    - Replaced the aforementioned with the corresponding netlib implementations
      having been run through f2c (with some customization).
    - Added directories named 'attic' to build/gen-make-frags/ignore_list.

commit 1454c1a14207766dfed372b8e38b47fa384f5198
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Feb 22 12:38:45 2013 -0600

    Moved Fortran name-mangling macro to bl2_config.h.
    
    Details:
    - Moved the Fortran-77 name-mangling macros from bl2_blas_macro_defs.h to the
      configuration directory (bl2_config.h, specifically) given that it can be
      expected to be tweaked by some developers.

commit ede75693e5a36c6006087c4a7df834175b604504 (tag: 0.0.3)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Feb 22 12:11:24 2013 -0600

    Implemented blas2blis compatibility layer.
    
    Details:
    - Added the blas2blis compatibility layer, located in frame/compat. This
      includes virtually all of the BLAS, including banded and packed level-2
      operations.
    
    - Defined bl2_init_safe(), bl2_finalize_safe(). The former allows a conditional
      initialization, which stores the "exit status" in an err_t, which is then
      read by the latter function to determine whether finalization should actually
      take place.
    - Added calls to bl2_init_safe(), bl2_finalize_safe() to all level-2 and
      level-3 BLAS-like wrappers.
    - Added configuration option to instruct BLIS to remain initialized whenever
      it automatically initializes itself (via bl2_init_safe()), until/unless the
      application code explicitly calls bl2_finalize().
    
    - Added INSERT_GENTFUNC* and INSERT_GENTPROT* macros to facilitate type
      templatization of blas2blis wrappers.
    - Defined level-0 scalar macro bl2_??swaps().
    - Defined level-1v operation bl2_swapv().
    - Defined some "Fortran" types to bl2_type_defs.h for use with BLAS
      wrappers.

commit 995edf43e21c1868732dbdd7fee14b08730218bd
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 21 14:30:50 2013 -0600

    Updated version file. (Forgot to in prev commit).

commit e823b08aaf7b65ecc6ddc30570709ea8a4b52aa7
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 21 12:00:17 2013 -0600

    Fixed some scalar types in BLAS-like Herm APIs.
    
    Details:
    - Some of the scalars of Hermitian operations, such as alpha in her,
      alpha and beta in herk, and beta in her2k, need to be real. These
      arguments were typed incorrectly as the complex types. This has been
      fixed. Note the issue was only present in the BLAS-like APIs for
      these operations (not the native object-based interfaces).

commit 5ece050a669e74ba4a711d1d4669239d22d45642
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Feb 20 15:50:54 2013 -0600

    Updated version file. (Forgot to in prev commit).

commit f243034b8b430d4684680ea8eddfd246e73fefc0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Feb 20 14:11:36 2013 -0600

    Changed API of packm_init_pack() to use blksz_t.
    
    Details:
    - Changed the interface of packm_init_pack() so that mult_m and mult_n
      are passed in as type blksz_t* instead of dim_t.
    - Make similar change for packv_init_pack().

commit da0c22f24107be9f33e0ea2dae52e5534b1fd0e5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Feb 15 09:59:48 2013 -0600

    Minor changes to lower levels of scalm and setm.
    
    Details:
    - Removed diagx parameter from lower-level interfaces of scalm.
    - Modified scalm_basic_check() to expect an object with a nonunit diagonal.
    - Changed setm_unb_var1() so that having an implicit unit diagonal results
      in only the strictly lower or upper triangle of the matrix being modified.

commit 2c836adadcd2a7d7f217033ac4d7fcad03d5bd55
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 14 10:42:56 2013 -0600

    Updated beta == zero semantics of mulsc.
    
    Details:
    - Updated beta == zero semantics of mulsc. Hopefully this is the last
      operation that needed updating.
    - Added Devin to CREDITS file.

commit 722b66c7dcaaaa1b109e7c8b1d53fd71a9af8240
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 14 10:18:00 2013 -0600

    Removed some calls to setv() in test modules.
    
    Details:
    - Removed calls to setv() in test modules whose sole purpose was to
      initialize vectors to zero to ensure that nan's and inf's would not
      taint the computation. Now that beta == zero semantics have been
      updated to clear the output operand (when beta is zero), rather than
      multiply against it, these setv() calls are no longer needed.

commit e6ac623a902f776c42f85eadbf76996d9770a0db
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Feb 13 18:44:59 2013 -0600

    Properly implemented beta == 0 semantics.
    
    Details:
    - Changed name of set0 and set0_mxn macros to set0s and set0s_mxn,
      respectively.
    - Added code to the following operations that sets the output operand to
      zero if the corresponding scalar is zero (rather than performing the
      floating-point multiply, or in the case of setv, copying the value).
      This will prevent nan's and inf's from creeping into results from
      uninitialized memory.
      - axpy
      - dotxv
      - scalv
      - scal2v
      - setv
      - gemv
      - ger
      - hemv
      - her
      - her2
      - gemm reference ukernels

commit aedccbc85d491e41711a0c6eb0d246d8700a199a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Feb 13 18:29:53 2013 -0600

    Fixed stale interface to packm_unb_var1().
    
    Details:
    - Removed the control tree from the interface to packm_unb_var1(), which
      I meant to do when it was un-deprecated.

commit c23135669f7a8a545e2e11ef559bf284be8bc65c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Feb 13 13:21:00 2013 -0600

    Un-deprecated packm_unb_var1.c (needed by l2 ops).
    
    Details:
    - Added bl2_packm_unb_var1() back into the mix once I realized that level-2
      operations still need this routine for packing matrices. Now, whether
      level-2 operations should be packing matrices to begin with is another
      matter. But this fixes the segmentation fault one would have gotten when
      running bl2_gemv() on a general stride matrix.

commit cf49e35f9819f9d93ebdca4703ade5abab28f6f6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 12 18:39:35 2013 -0600

    Removed cntl tree usage from packm implementation.
    
    Details:
    - Added new fields to obj_t info field:
      - invert_diag
      - pack_order_if_upper
      - pack_order_if_lower
      These fields allow packm_init() to embed information that begins
      in the control tree into the object so that the packm implementation
      does not need to use control trees at all. This is being done to aid
      Bryan's DxT code generation.
    - Added macros that operate on above fields.
    - Changed packm_init(), packm_blk_var2(), and packm_blk_var3() according
      to above changes.
    - Made similar (but much simpler) changes to packv.
    - Deprecated packm_blk_var1(), packm_unb_var1(), and packm_densify().
      These were part of prototype implementations and are no longer needed.

commit eb139ae256651af7820b93ef982626180195b87f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 12 12:39:30 2013 -0600

    Replaced bl2_abs() with _fabs() where appropriate.

commit 474bac30c99928f9e87315972bcb45c632c0b7ec
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 12 12:23:48 2013 -0600

    Removed level-0 macros projrs, grabis.
    
    Details:
    - Replaced instances of projrs and grabis macros with newer,
      more general-purpose getris.

commit 03a260a457c8964e4603a655cee0d40ac17affba
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 12 11:45:34 2013 -0600

    Restored executable permissions to scripts.
    
    Details:
    - Restored executable (0755) permissions to scripts that were touched by
      the recursive sed script that updated the copyright headers in the
      previous commit.

commit 1274e1243775e5e705114257a43176f63635227f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Feb 11 14:37:47 2013 -0600

    Updated copyright headers from 2012 to 2013.

commit 3b620cc8e90c53c79129bd9dd89ae6b77c2446f1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Feb 11 13:38:07 2013 -0600

    CHANGELOG update.

commit 768fcebaa8be0eb936a6e7a02cd8a19438c79d99 (tag: 0.0.2)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Feb 11 13:20:44 2013 -0600

    Added unified test suite, and many fixes.
    
    Details:
    - Added a highly configurable, unified test suite.
    
    - Removed DUPB configuration constant from bl2_kernel.h and macro-kernel
      header files. Now, instead, DUPB is computed as (NDUP != 1) within each
      macro-kernel. This fixes a bug in trmm/trsm whereby bp was indexed into
      incorrectly when DUPB was set to FALSE but the NDUP was still non-unit.
      By encoding both pieces of information into one constant in _kernel.h,
      it seems somewhat less likely others will encounter this bug in the
      future.
    - Added level-2 cache blocksizes to _kernel.h for reference configuration,
      and defined blocksizes in _cntl.c files to these default values.
    
    - Changed semantics of her2k and syr2k such that these operations no longer
      expect the B matrix to already be conjugate-transposed (or just transposed
      for syr2k). However, these semantics are preserved for the internal
      mechanics of the implementations, including the internal back-end and all
      blocked variants.
    - Inserted checks for real-valued alpha and beta for herk/her2k and herk,
      respectively.
    
    - Relaxed general object structure constraints in _basic_check() for gemv, ger.
    - Changed her front-end to NOT copy-cast to real projection; instead, this is
      replaced by selecting either the real part or both parts within the unblocked
      algorithm implementation, depending on the value of conjh.
    - Added conjh to all _check routines for her so that the code knows when to
      verify that alpha has an imaginary component equal to zero (for her, but
      not syr).
    - Changed control tree for her to forgo packing.
    
    - Added unit diagonal support to fnormm.
    - Redefined real versions of abval2s macros in terms of fabs(), fabsf().
    - Redefined complex versions of sqrt2s macros using the actual "complex square
      root" formula.
    - Created new level-0 object-based routines, suffixed with "sc" (for "scalar").
    - Defined new level-1v, -1d, and -1m versions of add and sub operations
      (two-operand add and subtract).
    - Added new scalar macros:
      - getris: acquire real and imaginary components.
      - setris: set real and imaginary components.
      - addjs: addition with conjugated x.
      - subjs: subtraction with conjugated x.
    - Defined new utility operations:
      - absumv: element-wise sum of absolute values for vector elements.
      - absumm: element-wise sum of absolute values for matrix elements.
      - mkherm: convert existing matrix to Hermitian.
      - mksymm: convert existing matrix to symmetric.
      - mktrim: convert existing matrix to triangular.
    
    - Added various error checking routines.
    - Added bl2_clock_min_diff(), which is used to more cleanly measure the
      wall clock time of a code block.
    - Added general stride support to bl2_obj_alloc_buffer().
    - Added bl2_obj_init_scalar().
    - Updated parameter mapping in bl2_param_map.c.
    - Added support for queriable version string.
    
    - Fixed a bug in the her2k macro-kernels (which currently are simply
      implemented in terms of two invocations of herk) whereby beta was being
      applied to both the first and second rank-k updates, rather than only
      the first.
    - Fixed a bug in trmm/trsm whereby transpose and right side cases were not
      properly implemented due to erroneous assumptions regarding aliasing and
      root objects.
    - Fixed a bug in the upper triangular trsm macro-kernel in which the wrong
      MR x NR block of B was being updated.
    - Fixed a bug in the inverts macro in the double real case whereby the
      value was typecast to float before inversion. This affected non-unit cases
      of dtrsm.
    - Fixed a bug in the reference kernels for gemmtrsm whereby the minus one
      constant was being applied incorrectly.
    - Fixed a bug in the overall treatment of non-unit alpha for trsm. The code
      now mimics the rank-k strategy of gemm, whereby alpah is applied during
      the first iteration of variant 3, with BLIS_ONE passed in instead for
      subsequent iterations. This also required passing alpha into the macro-
      kernels as well as the fused gemmtrsm micro-kernels.
    - Fixed a bug in trsm_u_blk_var1 whereby the gemm macro-kernel was being
      called for blocks strictly above the diagonal. While this sounds good in
      theory, this cannot be done because gemm_ker_var2 expects row panels of
      A to be packed from top to bottom, while for trsm_u, A is actually packed
      from bottom to top due to the reverse (BR->TL) nature of the algorithm.
    - Fixed a bug in packm_cxk() whereby panel packings with unit panel
      dimensions were mishandled due to incorrect arguments to the copyv kernel.
      Also changed the copyv kernel invocation to scal2v so that these edge
      cases are properly handled when scaling is requested.
    - Fixed a bug in packv_int() whereby an uninitialized object is passed in
      instead of the source object.
    - Fixed a bug whereby level-2 code could allocate memory dynamically via
      bl2_malloc() and then attempt to free it via bl2_mm_release(). Also fixed
      a potential future bug whereby a mem_t object that is actually no longer
      "allocated" from the static pool is mistaken for being allocated due to
      failure to NULLify the buffer when the block was most recently released.
    - Fixed a bug in bl2_acquire_mpart_*() whreby the uplo field was mistakenly
      toggled when the requested subpartition needed to be "reflected" due to it
      residing in an unstored region.

commit be94fb84c0351602d7585269f29998e3bf83f899
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jan 4 10:55:21 2013 -0600

    Added missing 'd' to fused gemmtrsm function name.

commit 879a179e1dee36f0c56765f2ab91a26861019b34
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jan 4 10:37:27 2013 -0600

    Added debug statements to bl2_mm_acquire_m().
    
    Details:
    - Added printf() statements to bl2_mm_acquire_m() to help debug issues
      with prematurely exhausted memory pool.
    - Removed 'd' from kernel names of reference kernels in clarksville
      configuration's bl2_kernel.h

commit 806e74beb4eafeef620a555ffbb3f6779e29c7b6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Dec 20 17:07:50 2012 -0600

    Defined Frobenius norm operations.
    
    Details:
    - Added level-0 grabis macro operation to grab imaginary component of one
      variable and copy it to the real component of another variable.
    - Defined sumsqv operation, which computes the sum of the absolute squares
      of the elements of a vector. This implementation is modeled after ?lassq
      in netlib LAPACK.
    - Defined fnormv and fnormm operations, which compute the Frobenius norm on
      vectors and matrices, respectively. These operations are treated as one-
      operand operations where the output norm value is the real projection of
      the datatype of the input operand. Both operations are implemented in terms
      of sumsqv.

commit 66e80ce1aec099b2b2b0c4f295e38add2c921383
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Dec 20 17:02:55 2012 -0600

    Added GENT*R macros; tweaked bl2_machval defs.
    
    Details:
    - Added function and prototype macro-generating macros for GENTFUNCR and
      GENTPROTR, which are one-operand macros with auxiliary real projection
      types.
    - Tweaked bl2_machval files to use new macros.

commit 2fecc88ca22142020573f168da715e8e9f3dd7de
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Dec 20 11:35:14 2012 -0600

    Fixed harmless macro bug in level-1m operations.
    
    Details:
    - Fixed some inconsistent usage of n_iter_max and n_iter in the two
      bl2_set_dims_incs_uplo_[12]m macros. The right thing ended up happening
      despite the bug, which is why I had not discovered it until now.

commit 8945db6ec9f82168cf72411ad408b4fdb44ae0d1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Dec 18 15:07:36 2012 -0600

    Renamed x86,x86_64 kernels to indicate 'd' fusing.
    
    Details:
    - Renamed x86 and x86_64 kernels to contain a 'd' before the fusing shape
      to emphasize that the fusing shape is not for all datatype instances, but
      rather just for one (that of double-precision real). Other fusing shapes
      would be proportional to their precision and domain "byte footprints".
    - Corresponding changes to config/clarksville/bl2_kernel.h.

commit 6fbbdd4e194d06096ad08c5db61127be338067db
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Dec 18 14:34:02 2012 -0600

    More tweaks to _config.h, _kernel.h; smem tweaks.
    
    Details:
    - Moved kernel-related definitions form bl2_config.h to bl2_kernel.h.
    - Replaced #define of _GNU_SOURCE with #define of _POSIX_C_SOURCE. This
      accomplishes the same thing (enabling posix_memalign()) without enabling
      all of the GNU extensions we don't need.
    - Defined the size of the static memory pool in terms of MC, KC, and NC,
      as well as two new constants that determine how many MCxKC blocks and
      how many KCxNC blocks should be allocated (defined in bl2_config.h).
    - In the case of static memory pool exhaustion, replaced the generic
      bl2_abort() with a specific error code call.

commit 5d8bdb21c48e8fb11bef6128a242122cc1470a99
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 17 16:07:36 2012 -0600

    Minor reordering of bl2_config.h definitions.

commit 4a83f67490136a898f558e273b76a687aed8b893
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 17 12:35:54 2012 -0600

    Consolidated configuration headers.
    
    Details:
    - Merged contents of bl2_arch.h into bl2_config.h for reference and
      clarksville configurations.
    - Updated CREDITS, INSTALL, LICENSE, README files.

commit 0670c33cc14612f636ef09ede4133404ae0af6ba
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Dec 14 12:45:26 2012 -0600

    Fixed bug in reference gemm ukernels.
    
    Details:
    - Fixed a bug whereby, for the reference gemm ukernels, the matrix product
      was not correctly accumulated and scaled (by alpha) into the output matrix
      C. (Thanks to Fran for finding this bug.)
    - Whitespace changes to reference trsm kernels.

commit e2e7cb2fbe615be4d375bc2dce88d03d98fadc9e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Dec 13 18:17:54 2012 -0600

    Expanded reference packm/unpackm kernel set to 16.
    
    Details:
    - Added 10xk, 12xk, 14xk, and 16xk reference kernels for packm and
      unpackm.
    - Updated bl2_[un]packm_cxk() to silently use scal2m if "out of range"
      kernel size is requested. (Thanks to Tyler for finding this bug.)
    - Updated bl2_kernel.h to contain new _KERNEL definitions, according
      to above changes, for 'reference' and 'clarksville' configurations.
    - Updated CHANGELOG.
    - Removed "output*.m" from .gitignore.

commit 17455a8bce038dd570356ab0c5c11d9a89f20248
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 10 17:23:32 2012 -0600

    Minor updates towards to 0.0.1.

commit 7ad4ebef38b8e6eea9b6091844ba7294ec870271 (tag: 0.0.1)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 10 16:18:40 2012 -0600

    Tweaks to get BLIS compiling again on clarksville.
    
    Details:
    - Updated header files and make_defs.mk in config/clarksville.
    - Fixes to bl2_mem.c (now that SMEM_M, SMEM_N are gone).
    - Moved definition of blksz_t from bl2_cntl.h to bl2_type_defs.h.
    - Shuffled include statements in blis2.h.

commit cc58ea86010b1f046134d13b546c878389df9af5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 10 14:55:12 2012 -0600

    Added template fragment.mk; updated .gitignore.

commit 714c527b0eb153b7e2040b79349edc8372f743fd
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Dec 7 19:54:04 2012 -0600

    Added 'changelog' make target; other tweaks.
    
    Details:
    - Updated CHANGELOG.
    - Added 'changelog' target to Makefile that runs 'git log --decorate' and
      overwrites CHANGELOG with the output.
    - Other trivial changes.

commit e4e5404d26aded4873278e85faf6f14ac32115b5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Dec 7 17:34:53 2012 -0600

    Define static memory pool size in bl2_config.h.

commit 19bb507d0de6a2bd3ce37cf616bdcd6b419ed641
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Dec 7 17:18:00 2012 -0600

    Refined INSTALL text; added 'showconfig' target.
    
    Details:
    - Added 'showconfig' target to Makefile.
    - Added header files and ./config/<configname>/make_defs.mk as prerequisites
      to object file rules.
    - Added config.mk as prerequisite to library install rules.
    - Edited and added to INSTALL file.

commit 26cb659dd79636489db5a051aa60fff80273a7b9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Dec 6 15:34:53 2012 -0600

    Added auto-detection of version string (via git).
    
    Details:
    - Added build/update-version-file.sh script for auto-detecting "version"
      string and updating 'version' file accordingly. (If .git directory is
      not present, then it is assumed this copy of BLIS is a downloaded
      release, in which case 'version' file is left unchanged.)
    - Added invocation of update-version-file.sh to configure script.

commit b0ecd0ff52fa6ffc9e1d9eb44c365f7f009a6204
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Dec 6 14:27:11 2012 -0600

    Wrote first draft of INSTALL file.

commit bcbe81235a35ccfdbcc2f2319a0ca6e04f75a785 (tag: 0.0.0)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Dec 6 12:42:35 2012 -0600

    Updated standalone test Makefile and other fixes.
    
    Details:
    - Major edits to test/Makefile to bring up-to-date wrt new build system;
      should no longer be broken.
    - Minor edits to top-level Makefile.
    - Fixed copy-and-paste bugs in
      - frame/1m/packm/ukernels/bl2_packm_ref_?xk.c
      - frame/1m/unpackm/ukernels/bl2_unpackm_ref_?xk.c

commit 2f272b40f43307909736327f49d17737c7a05d37
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Dec 4 19:22:14 2012 -0600

    Added build system and continued reorganization.
    
    Details:
    - Added/renamed packm, unpackm kernels.
    - Added machine value routines.
    - Added param_map facility.
    - Renamed AUTHORS to CREDITS.
    - Added Makefile; continued to expand upon existing configure script.
    - #define fuse_fac macros in operation headers if not defined already
      (by the user in bl2_kernels.h).

commit 00f3498a8943be1b387f0d5c029c8c7891687ad5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 3 12:36:11 2012 -0600

    Initial commit.
