From a7d7d2e1a07e3811dc49af2962c940fd8bbb6c8f Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 27 Jan 2012 14:12:32 -0300 Subject: edac: Create a dimm struct and move the labels into it MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The way a DIMM is currently represented implies that they're linked into a per-csrow struct. However, some drivers don't see csrows, as they're ridden behind some chip like the AMB's on FBDIMM's, for example. This forced drivers to fake^Wvirtualize a csrow struct, and to create a mess under csrow/channel original's concept. Move the DIMM labels into a per-DIMM struct, and add there the real location of the socket, in terms of csrow/channel. Latter patches will modify the location to properly represent the memory architecture. All other drivers will use a per-csrow type of location. Some of those drivers will require a latter conversion, as they also fake the csrows internally. TODO: While this patch doesn't change the existing behavior, on csrows-based memory controllers, a csrow/channel pair points to a memory rank. There's a known bug at the EDAC core that allows having different labels for the same DIMM, if it has more than one rank. A latter patch is need to merge the several ranks for a DIMM into the same dimm_info struct, in order to avoid having different labels for the same DIMM. The edac_mc_alloc() will now contain a per-dimm initialization loop that will be changed by latter patches in order to match other types of memory architectures. Reviewed-by: Aristeu Rozanski Reviewed-by: Borislav Petkov Cc: Doug Thompson Cc: Ranganathan Desikan Cc: "Arvind R." Cc: "Niklas Söderlund" Signed-off-by: Mauro Carvalho Chehab --- include/linux/edac.h | 28 +++++++++++++++++++++++----- 1 file changed, 23 insertions(+), 5 deletions(-) (limited to 'include') diff --git a/include/linux/edac.h b/include/linux/edac.h index c621d762bb2c..52bceca85e63 100644 --- a/include/linux/edac.h +++ b/include/linux/edac.h @@ -312,23 +312,34 @@ enum scrub_type { * PS - I enjoyed writing all that about as much as you enjoyed reading it. */ +/* FIXME: add a per-dimm ce error count */ +struct dimm_info { + char label[EDAC_MC_LABEL_LEN + 1]; /* DIMM label on motherboard */ + unsigned memory_controller; + unsigned csrow; + unsigned csrow_channel; +}; + /** * struct rank_info - contains the information for one DIMM rank * * @chan_idx: channel number where the rank is (typically, 0 or 1) * @ce_count: number of correctable errors for this rank - * @label: DIMM label. Different ranks for the same DIMM should be - * filled, on userspace, with the same label. - * FIXME: The core currently won't enforce it. * @csrow: A pointer to the chip select row structure (the parent * structure). The location of the rank is given by * the (csrow->csrow_idx, chan_idx) vector. + * @dimm: A pointer to the DIMM structure, where the DIMM label + * information is stored. + * + * FIXME: Currently, the EDAC core model will assume one DIMM per rank. + * This is a bad assumption, but it makes this patch easier. Later + * patches in this series will fix this issue. */ struct rank_info { int chan_idx; u32 ce_count; - char label[EDAC_MC_LABEL_LEN + 1]; - struct csrow_info *csrow; /* the parent */ + struct csrow_info *csrow; + struct dimm_info *dimm; }; struct csrow_info { @@ -428,6 +439,13 @@ struct mem_ctl_info { int mc_idx; int nr_csrows; struct csrow_info *csrows; + + /* + * DIMM info. Will eventually remove the entire csrows_info some day + */ + unsigned nr_dimms; + struct dimm_info *dimms; + /* * FIXME - what about controllers on other busses? - IDs must be * unique. dev pointer should be sufficiently unique, but -- cgit v1.2.3 From 084a4fccef39ac7abb039511f32380f28d0b67e6 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 27 Jan 2012 18:38:08 -0300 Subject: edac: move dimm properties to struct dimm_info MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit On systems based on chip select rows, all channels need to use memories with the same properties, otherwise the memories on channels A and B won't be recognized. However, such assumption is not true for all types of memory controllers. Controllers for FB-DIMM's don't have such requirements. Also, modern Intel controllers seem to be capable of handling such differences. So, we need to get rid of storing the DIMM information into a per-csrow data, storing it, instead at the right place. The first step is to move grain, mtype, dtype and edac_mode to the per-dimm struct. Reviewed-by: Aristeu Rozanski Reviewed-by: Borislav Petkov Acked-by: Chris Metcalf Cc: Doug Thompson Cc: Borislav Petkov Cc: Mark Gross Cc: Jason Uhlenkott Cc: Tim Small Cc: Ranganathan Desikan Cc: "Arvind R." Cc: Olof Johansson Cc: Egor Martovetsky Cc: Michal Marek Cc: Jiri Kosina Cc: Joe Perches Cc: Dmitry Eremin-Solenikov Cc: Benjamin Herrenschmidt Cc: Hitoshi Mitake Cc: Andrew Morton Cc: James Bottomley Cc: "Niklas Söderlund" Cc: Shaohui Xie Cc: Josh Boyer Cc: Mike Williams Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Mauro Carvalho Chehab --- include/linux/edac.h | 21 +++++++++++++-------- 1 file changed, 13 insertions(+), 8 deletions(-) (limited to 'include') diff --git a/include/linux/edac.h b/include/linux/edac.h index 52bceca85e63..87aa07d2ee28 100644 --- a/include/linux/edac.h +++ b/include/linux/edac.h @@ -318,6 +318,13 @@ struct dimm_info { unsigned memory_controller; unsigned csrow; unsigned csrow_channel; + + u32 grain; /* granularity of reported error in bytes */ + enum dev_type dtype; /* memory device type */ + enum mem_type mtype; /* memory dimm type */ + enum edac_type edac_mode; /* EDAC mode for this dimm */ + + u32 ce_count; /* Correctable Errors for this dimm */ }; /** @@ -343,19 +350,17 @@ struct rank_info { }; struct csrow_info { - unsigned long first_page; /* first page number in dimm */ - unsigned long last_page; /* last page number in dimm */ + unsigned long first_page; /* first page number in csrow */ + unsigned long last_page; /* last page number in csrow */ + u32 nr_pages; /* number of pages in csrow */ unsigned long page_mask; /* used for interleaving - * 0UL for non intlv */ - u32 nr_pages; /* number of pages in csrow */ - u32 grain; /* granularity of reported error in bytes */ - int csrow_idx; /* the chip-select row */ - enum dev_type dtype; /* memory device type */ + int csrow_idx; /* the chip-select row */ + u32 ue_count; /* Uncorrectable Errors for this csrow */ u32 ce_count; /* Correctable Errors for this csrow */ - enum mem_type mtype; /* memory csrow type */ - enum edac_type edac_mode; /* EDAC mode for this csrow */ + struct mem_ctl_info *mci; /* the parent */ struct kobject kobj; /* sysfs kobject for this csrow */ -- cgit v1.2.3 From a895bf8b1e1ea4c032a8fa8a09475a2ce09fe77a Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Sat, 28 Jan 2012 09:09:38 -0300 Subject: edac: move nr_pages to dimm struct MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The number of pages is a dimm property. Move it to the dimm struct. After this change, it is possible to add sysfs nodes for the DIMM's that will properly represent the DIMM stick properties, including its size. A TODO fix here is to properly represent dual-rank/quad-rank DIMMs when the memory controller represents the memory via chip select rows. Reviewed-by: Aristeu Rozanski Acked-by: Borislav Petkov Acked-by: Chris Metcalf Cc: Doug Thompson Cc: Mark Gross Cc: Jason Uhlenkott Cc: Tim Small Cc: Ranganathan Desikan Cc: "Arvind R." Cc: Olof Johansson Cc: Egor Martovetsky Cc: Michal Marek Cc: Jiri Kosina Cc: Joe Perches Cc: Dmitry Eremin-Solenikov Cc: Benjamin Herrenschmidt Cc: Hitoshi Mitake Cc: Andrew Morton Cc: "Niklas Söderlund" Cc: Shaohui Xie Cc: Josh Boyer Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Mauro Carvalho Chehab --- include/linux/edac.h | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) (limited to 'include') diff --git a/include/linux/edac.h b/include/linux/edac.h index 87aa07d2ee28..67717cab1313 100644 --- a/include/linux/edac.h +++ b/include/linux/edac.h @@ -324,6 +324,8 @@ struct dimm_info { enum mem_type mtype; /* memory dimm type */ enum edac_type edac_mode; /* EDAC mode for this dimm */ + u32 nr_pages; /* number of pages in csrow */ + u32 ce_count; /* Correctable Errors for this dimm */ }; @@ -350,12 +352,12 @@ struct rank_info { }; struct csrow_info { + /* Used only by edac_mc_find_csrow_by_page() */ unsigned long first_page; /* first page number in csrow */ unsigned long last_page; /* last page number in csrow */ - u32 nr_pages; /* number of pages in csrow */ unsigned long page_mask; /* used for interleaving - - * 0UL for non intlv - */ + * 0UL for non intlv */ + int csrow_idx; /* the chip-select row */ u32 ue_count; /* Uncorrectable Errors for this csrow */ -- cgit v1.2.3 From 982216a4290543fe73ae4f0a156f3d7906bd9b73 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 16 Apr 2012 13:04:46 -0300 Subject: edac.h: Add generic layers for describing a memory location The edac core were written with the idea that memory controllers are able to directly access csrows, and that the channels are used inside a csrows select. This is not true for FB-DIMM and RAMBUS memory controllers. Also, some recent advanced memory controllers don't present a per-csrows view. Instead, they view memories as DIMMs, instead of ranks, accessed via csrow/channel. So, changes are needed in order to allow the EDAC core to work with all types of architectures. In preparation for handling non-csrows based memory controllers, add some memory structs and a macro: enum hw_event_mc_err_type: describes the type of error (corrected, uncorrected, fatal) To be used by the new edac_mc_handle_error function; enum edac_mc_layer: describes the type of a given memory architecture layer (branch, channel, slot, csrow). struct edac_mc_layer: describes the properties of a memory layer (type, size, and if the layer will be used on a virtual csrow. EDAC_DIMM_PTR() - as the number of layers can vary from 1 to 3, this macro converts from an address with up to 3 layers into a linear address. Reviewed-by: Borislav Petkov Cc: Doug Thompson Signed-off-by: Mauro Carvalho Chehab --- include/linux/edac.h | 103 ++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 102 insertions(+), 1 deletion(-) (limited to 'include') diff --git a/include/linux/edac.h b/include/linux/edac.h index 67717cab1313..9e628434e164 100644 --- a/include/linux/edac.h +++ b/include/linux/edac.h @@ -70,6 +70,25 @@ enum dev_type { #define DEV_FLAG_X32 BIT(DEV_X32) #define DEV_FLAG_X64 BIT(DEV_X64) +/** + * enum hw_event_mc_err_type - type of the detected error + * + * @HW_EVENT_ERR_CORRECTED: Corrected Error - Indicates that an ECC + * corrected error was detected + * @HW_EVENT_ERR_UNCORRECTED: Uncorrected Error - Indicates an error that + * can't be corrected by ECC, but it is not + * fatal (maybe it is on an unused memory area, + * or the memory controller could recover from + * it for example, by re-trying the operation). + * @HW_EVENT_ERR_FATAL: Fatal Error - Uncorrected error that could not + * be recovered. + */ +enum hw_event_mc_err_type { + HW_EVENT_ERR_CORRECTED, + HW_EVENT_ERR_UNCORRECTED, + HW_EVENT_ERR_FATAL, +}; + /** * enum mem_type - memory types. For a more detailed reference, please see * http://en.wikipedia.org/wiki/DRAM @@ -312,7 +331,89 @@ enum scrub_type { * PS - I enjoyed writing all that about as much as you enjoyed reading it. */ -/* FIXME: add a per-dimm ce error count */ +/** + * enum edac_mc_layer - memory controller hierarchy layer + * + * @EDAC_MC_LAYER_BRANCH: memory layer is named "branch" + * @EDAC_MC_LAYER_CHANNEL: memory layer is named "channel" + * @EDAC_MC_LAYER_SLOT: memory layer is named "slot" + * @EDAC_MC_LAYER_CHIP_SELECT: memory layer is named "chip select" + * + * This enum is used by the drivers to tell edac_mc_sysfs what name should + * be used when describing a memory stick location. + */ +enum edac_mc_layer_type { + EDAC_MC_LAYER_BRANCH, + EDAC_MC_LAYER_CHANNEL, + EDAC_MC_LAYER_SLOT, + EDAC_MC_LAYER_CHIP_SELECT, +}; + +/** + * struct edac_mc_layer - describes the memory controller hierarchy + * @layer: layer type + * @size: number of components per layer. For example, + * if the channel layer has two channels, size = 2 + * @is_virt_csrow: This layer is part of the "csrow" when old API + * compatibility mode is enabled. Otherwise, it is + * a channel + */ +struct edac_mc_layer { + enum edac_mc_layer_type type; + unsigned size; + bool is_virt_csrow; +}; + +/* + * Maximum number of layers used by the memory controller to uniquely + * identify a single memory stick. + * NOTE: Changing this constant requires not only to change the constant + * below, but also to change the existing code at the core, as there are + * some code there that are optimized for 3 layers. + */ +#define EDAC_MAX_LAYERS 3 + +/** + * EDAC_DIMM_PTR - Macro responsible to find a pointer inside a pointer array + * for the element given by [layer0,layer1,layer2] position + * + * @layers: a struct edac_mc_layer array, describing how many elements + * were allocated for each layer + * @var: name of the var where we want to get the pointer + * (like mci->dimms) + * @n_layers: Number of layers at the @layers array + * @layer0: layer0 position + * @layer1: layer1 position. Unused if n_layers < 2 + * @layer2: layer2 position. Unused if n_layers < 3 + * + * For 1 layer, this macro returns &var[layer0] + * For 2 layers, this macro is similar to allocate a bi-dimensional array + * and to return "&var[layer0][layer1]" + * For 3 layers, this macro is similar to allocate a tri-dimensional array + * and to return "&var[layer0][layer1][layer2]" + * + * A loop could be used here to make it more generic, but, as we only have + * 3 layers, this is a little faster. + * By design, layers can never be 0 or more than 3. If that ever happens, + * a NULL is returned, causing an OOPS during the memory allocation routine, + * with would point to the developer that he's doing something wrong. + */ +#define EDAC_DIMM_PTR(layers, var, nlayers, layer0, layer1, layer2) ({ \ + typeof(var) __p; \ + if ((nlayers) == 1) \ + __p = &var[layer0]; \ + else if ((nlayers) == 2) \ + __p = &var[(layer1) + ((layers[1]).size * (layer0))]; \ + else if ((nlayers) == 3) \ + __p = &var[(layer2) + ((layers[2]).size * ((layer1) + \ + ((layers[1]).size * (layer0))))]; \ + else \ + __p = NULL; \ + __p; \ +}) + + +/* FIXME: add the proper per-location error counts */ struct dimm_info { char label[EDAC_MC_LABEL_LEN + 1]; /* DIMM label on motherboard */ unsigned memory_controller; -- cgit v1.2.3 From 4275be63559719c3149b19751029f1b0f1b26775 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Wed, 18 Apr 2012 15:20:50 -0300 Subject: edac: Change internal representation to work with layers MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Change the EDAC internal representation to work with non-csrow based memory controllers. There are lots of those memory controllers nowadays, and more are coming. So, the EDAC internal representation needs to be changed, in order to work with those memory controllers, while preserving backward compatibility with the old ones. The edac core was written with the idea that memory controllers are able to directly access csrows. This is not true for FB-DIMM and RAMBUS memory controllers. Also, some recent advanced memory controllers don't present a per-csrows view. Instead, they view memories as DIMMs, instead of ranks. So, change the allocation and error report routines to allow them to work with all types of architectures. This will allow the removal of several hacks with FB-DIMM and RAMBUS memory controllers. Also, several tests were done on different platforms using different x86 drivers. TODO: a multi-rank DIMMs are currently represented by multiple DIMM entries in struct dimm_info. That means that changing a label for one rank won't change the same label for the other ranks at the same DIMM. This bug is present since the beginning of the EDAC, so it is not a big deal. However, on several drivers, it is possible to fix this issue, but it should be a per-driver fix, as the csrow => DIMM arrangement may not be equal for all. So, don't try to fix it here yet. I tried to make this patch as short as possible, preceding it with several other patches that simplified the logic here. Yet, as the internal API changes, all drivers need changes. The changes are generally bigger in the drivers for FB-DIMMs. Cc: Aristeu Rozanski Cc: Doug Thompson Cc: Borislav Petkov Cc: Mark Gross Cc: Jason Uhlenkott Cc: Tim Small Cc: Ranganathan Desikan Cc: "Arvind R." Cc: Olof Johansson Cc: Egor Martovetsky Cc: Chris Metcalf Cc: Michal Marek Cc: Jiri Kosina Cc: Joe Perches Cc: Dmitry Eremin-Solenikov Cc: Benjamin Herrenschmidt Cc: Hitoshi Mitake Cc: Andrew Morton Cc: "Niklas Söderlund" Cc: Shaohui Xie Cc: Josh Boyer Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Mauro Carvalho Chehab --- include/linux/edac.h | 38 +++++++++++++++++++++++++------------- 1 file changed, 25 insertions(+), 13 deletions(-) (limited to 'include') diff --git a/include/linux/edac.h b/include/linux/edac.h index 9e628434e164..d68b01cad068 100644 --- a/include/linux/edac.h +++ b/include/linux/edac.h @@ -416,18 +416,20 @@ struct edac_mc_layer { /* FIXME: add the proper per-location error counts */ struct dimm_info { char label[EDAC_MC_LABEL_LEN + 1]; /* DIMM label on motherboard */ - unsigned memory_controller; - unsigned csrow; - unsigned csrow_channel; + + /* Memory location data */ + unsigned location[EDAC_MAX_LAYERS]; + + struct mem_ctl_info *mci; /* the parent */ u32 grain; /* granularity of reported error in bytes */ enum dev_type dtype; /* memory device type */ enum mem_type mtype; /* memory dimm type */ enum edac_type edac_mode; /* EDAC mode for this dimm */ - u32 nr_pages; /* number of pages in csrow */ + u32 nr_pages; /* number of pages on this dimm */ - u32 ce_count; /* Correctable Errors for this dimm */ + unsigned csrow, cschannel; /* Points to the old API data */ }; /** @@ -447,9 +449,10 @@ struct dimm_info { */ struct rank_info { int chan_idx; - u32 ce_count; struct csrow_info *csrow; struct dimm_info *dimm; + + u32 ce_count; /* Correctable Errors for this csrow */ }; struct csrow_info { @@ -545,13 +548,18 @@ struct mem_ctl_info { unsigned long (*ctl_page_to_phys) (struct mem_ctl_info * mci, unsigned long page); int mc_idx; - int nr_csrows; struct csrow_info *csrows; + unsigned nr_csrows, num_cschannel; + + /* Memory Controller hierarchy */ + unsigned n_layers; + struct edac_mc_layer *layers; + bool mem_is_per_rank; /* * DIMM info. Will eventually remove the entire csrows_info some day */ - unsigned nr_dimms; + unsigned tot_dimms; struct dimm_info *dimms; /* @@ -566,12 +574,16 @@ struct mem_ctl_info { const char *dev_name; char proc_name[MC_PROC_NAME_MAX_LEN + 1]; void *pvt_info; - u32 ue_noinfo_count; /* Uncorrectable Errors w/o info */ - u32 ce_noinfo_count; /* Correctable Errors w/o info */ - u32 ue_count; /* Total Uncorrectable Errors for this MC */ - u32 ce_count; /* Total Correctable Errors for this MC */ unsigned long start_time; /* mci load start time (in jiffies) */ + /* + * drivers shouldn't access those fields directly, as the core + * already handles that. + */ + u32 ce_noinfo_count, ue_noinfo_count; + u32 ue_count, ce_count; + u32 *ce_per_layer[EDAC_MAX_LAYERS], *ue_per_layer[EDAC_MAX_LAYERS]; + struct completion complete; /* edac sysfs device control */ @@ -584,7 +596,7 @@ struct mem_ctl_info { * by the low level driver. * * Set by the low level driver to provide attributes at the - * controller level, same level as 'ue_count' and 'ce_count' above. + * controller level. * An array of structures, NULL terminated * * If attributes are desired, then set to array of attributes -- cgit v1.2.3 From 5926ff502f6b93ca0c1654f8a5c5317ea236dbdb Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 9 Feb 2012 11:05:20 -0300 Subject: edac: Initialize the dimm label with the known information While userspace doesn't fill the dimm labels, add there the dimm location, as described by the used memory model. This could eventually match what is described at the dmidecode, making easier for people to identify the memory. For example, on an Intel motherboard where the DMI table is reliable, the first memory stick is described as: Memory Device Array Handle: 0x0029 Error Information Handle: Not Provided Total Width: 64 bits Data Width: 64 bits Size: 2048 MB Form Factor: DIMM Set: 1 Locator: A1_DIMM0 Bank Locator: A1_Node0_Channel0_Dimm0 Type: Type Detail: Synchronous Speed: 800 MHz Manufacturer: A1_Manufacturer0 Serial Number: A1_SerNum0 Asset Tag: A1_AssetTagNum0 Part Number: A1_PartNum0 The memory named as "A1_DIMM0" is physically located at the first memory controller (node 0), at channel 0, dimm slot 0. After this patch, the memory label will be filled with: /sys/devices/system/edac/mc/csrow0/ch0_dimm_label:mc#0channel#0slot#0 And (after the new EDAC API patches) as: /sys/devices/system/edac/mc/mc0/dimm0/dimm_label:mc#0channel#0slot#0 So, even if the memory label is not initialized on userspace, an useful information with the error location is filled there, expecially since several systems/motherboards are provided with enough info to map from channel/slot (or branch/channel/slot) into the DIMM label. So, letting the EDAC core fill it by default is a good thing. It should noticed that, as the label filling happens at the edac_mc_alloc(), drivers can override it to better describe the memories (and some actually do it). Cc: Aristeu Rozanski Cc: Doug Thompson Signed-off-by: Mauro Carvalho Chehab --- include/linux/edac.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include') diff --git a/include/linux/edac.h b/include/linux/edac.h index d68b01cad068..91ba3bae42ee 100644 --- a/include/linux/edac.h +++ b/include/linux/edac.h @@ -581,7 +581,7 @@ struct mem_ctl_info { * already handles that. */ u32 ce_noinfo_count, ue_noinfo_count; - u32 ue_count, ce_count; + u32 ue_mc, ce_mc; u32 *ce_per_layer[EDAC_MAX_LAYERS], *ue_per_layer[EDAC_MAX_LAYERS]; struct completion complete; -- cgit v1.2.3