< Back

When Solid State Drives are not that solid

It looked just like another page in the middle of the night. One of the servers of our search API stopped processing the indexing jobs for an unknown reason. Since we build systems in Algolia for high availability and resiliency, nothing bad was happening. The new API calls were correctly redirected to the rest of the healthy machines in the cluster and the only impact on the service was one woken-up engineer. It was time to find out what was going on.

SUMMARY:
1) the issue raised by Algolia is due to a Linux kernel error
2) Linux kernel error can affect any SSD under the same operating conditions
3) Samsung has also posted a Linux kernel patch that should fix the issue

UPDATE June 16:
A lot of discussions started pointing out that the issue is related to the newly introduced queued TRIM. This is not correct. The TRIM on our drives is un-queued and the issue we have found is not related to the latest changes in the Linux Kernel to disable this feature.

# smartctl -l gplog,0x13 /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.16.0-31-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

General Purpose Log 0x13 does not exist (override with ‘-T permissive’ option)

UPDATE June 17:
We got contacted by Samsung and we provided them all the system specifications and all the information about the issue we had. We will continue to provide Samsung all the necessary information in order to resolve the issue.

UPDATE June 18:
We just had a conference call with the European branch and the Korean HQ of Samsung. Their engineers are going to visit one of the datacenters we have servers in and in cooperation with our server provider they will inspect the mentioned SSDs in our SW and HW setup.

UPDATE June 19:
On Monday June 22, the engineering team from Samsung is going analyze one of our servers in Singapore and if nothing is found on-site, the server will travel to Samsung HQ in Korea for further analysis.

UPDATE July 13:
Since the last update of this blog-post, we have been in a cooperation with Samsung trying to help them find the issue, during this investigation we agreed with Samsung not to communicate until their approval.

As the issue was not reproduced on our server in Singapore, the reproduction is now running under Samsung supervision in Korea, out of our environment. Although Samsung requested multiple times an access to our software and corrupted data, we could not provide it to them in order to protect the privacy and data of our customers.

Samsung asked us to inform you about this:

  • Samsung tried to duplicate the failure with the latest script provided to them, but no single failure has been reproduced so far.
  • Samsung will do further tests, most likely from week 29 onwards, with a much more intensive script provided by Algolia.

After unsuccessful tries to reproduce the issue with Bash scripts we have decided to help them by creating a small C++ program that simulates the writing style and pattern of our application (no files are open with O_DIRECT). We believe that if the issue is coming from a specific way we are using the standard kernel calls, it might take a couple of days and terabytes of data to be written to the drive. We have been informed by Samsung that no issue of this kind have been reported to them. Our server provider has modified their Ubuntu 14.04 images to disable the fstrim cron in order to avoid this issue. For the last couple of months after not using trim anymore we have not seen the issue again.

UPDATE July 17:
We have just finished a conference call with Samsung considering the failure analysis of this issue. Samsung engineering team has been able to successfully reproduce the issue with our latest provided binary.

Samsung had a concrete conclusion that the issue is not related to Samsung SSD or Algolia software but is related to the Linux kernel.

Samsung has developed a kernel patch to resolve this issue and the official statement with details will be released tomorrow, July 18 on Linux community with the Linux patch guide. Our testing code is available on GitHub.

This has been an amazing ride, thank you everyone for joining, we have arrived at the destination.

For all followers of this blogpost and all the new readers:

The discovered issue has much bigger impact than we originally expected and is not caused by Samsung SSDs, as we originally assumed.
My personal apologies to Samsung!


 

The NGINX daemon serving all the HTTP(S) communication of our API was up and ready to serve the search queries but the indexing process crashed. Since the indexing process is guarded by supervise, crashing in a loop would have been understandable but a complete crash was not. As it turned out the filesystem was in a read-only mode. All right, let’s assume it was a cosmic ray 🙂 the filesystem got fixed, files were restored from another healthy server and everything looked fine again.

The next day another server ended with filesystem in read-only, two hours after another one and then next hour another one. Something was going on. After restoring the filesystem and the files, it was time for serious analysis since this was not a one time thing. At this point, we did a breakdown of the software involved in our storage stack and went through the recent changes.

Investigation & debugging time!

We first asked ourselves if it could be related to our software. Are we using non-safe system calls or processing the data in an unsafe way? Did we incorrectly read and write the files in the memory before flushing it to disk?

  • Filesystem – Is there a bug in ext4? Can we access the memory space of allocation tables by accident?
  • Mdraid – Is there a bug in mdadm? Did we use an improper configuration?
  • Driver – Does the driver have a bug?
  • SSD – Is the SSD dying? Or even worse, is there a problem with the firmware of the drive?

We even started to bet where the problem was and exactly proposed, in this order, the possible solutions going from easy to super-hard.

Going through storage procedures of our software stack allowed us to set up traps and in case the problem happens again, we would be able to better isolate the corrupted parts. Looking at every single storage call of our engine gave us enough confidence that the problem was not coming from the way in which we manipulate the data. Unfortunately.

One hour later, another server was corrupted. This time we took it out of the cluster and started to inspect it bit by bit. Before we fixed the filesystem, we noticed that some pieces of our files were missing (zeroed) – file modification date was unchanged, size was unchanged, just some parts were filled with zeros. Small files were completely erased. This was weird, so we started to think if it was possible that our application could access certain portions of the memory where the OS/filesystem had something mapped because otherwise our application cannot modify a file without the filesystem noticing. Having our software written in C++ brought a lot of crazy ideas of what happened. This turned out to be a dead-end as all of these memory blocks were out of our reach.

So is there an issue in the ext4? Going through the kernel changelog looking for ext4 related issues was a terrifying experience. In almost every version we found a fixed bug that could theoretically impact us. I have to admit, I slept better before reading the changelog.

We had kernels 3.2, 3.10, 3.13 and 3.16 distributed between the most often corrupted machines and waited to see which of the mines blows up. All of them did. Another dead-end. Maybe there was an issue in ext4 that no one else has seen before? The chance that we were this “lucky” was quite low and we did not want to end up in a situation like that. The possibility of a bug in ext4 was still open but highly improbable.

What if there was an issue in mdadm? Looking at the changelog gave us confidence that we should not go down this path.

The level of despair was reaching a critical level and the pages in the middle of the night were unstoppable. We spent a big portion of two weeks just isolating machines as quickly as possible and restoring them as quickly as possible. The one thing we did was to implement a check in our software that looked for empty blocks in the index files, even when they were not used, and alerted us in advance.

Not a single day without corruptions

While more and more machines were dying, we had managed to automate the restore procedure to a level we were comfortable with. At every failure, we tried to look at different patterns of the corruption in hopes that we would find the smallest common denominator. They all had the same characteristics. But one thing started to be more and more clear – we saw the issue only on a portion of our servers. The software stack was identical but the hardware was slightly different. Mainly the SSDs were different but they were all from the same manufacturer. This was very alarming and led us to contact our server provider to ask if they have ever seen something like this before. It’s hard to convince a technical support person about a problem that you see only once in a while, with the latest firmware and that you cannot reproduce on demand. We were not very successful but at least we had one small victory on our side.

Knowing that the issue existed somewhere in the combination of the software and drive itself, we reproduced the identical software stack from our servers with different drives. And? Nothing, the corruption appeared again. So it was quite safe to assume the problem was not in the software stack and was more drive related. But what causes a block to change the content without the rest of the system noticing? That would be a lot of rotten bits in a sequence…

The days started to become a routine – long shower, breakfast, restoring corrupted servers, lunch, restoring corrupted servers, dinner, restoring corrupted servers. Until one long morning shower full of thinking, “how big was the sequence?” As it turned out, the lost data was always 512 bytes, which is one block on the drive. One step further, a block ends up to be full of zeroes. A hardware bug? Or is the block zeroed? What can zero the block? TRIM! Trim instructs the SSD drive to zero the empty blocks. But these block were not empty and other types of SSDs were not impacted. We gave it a try and disabled TRIM across all of our servers. It would explain everything!

The next day not a single server was corrupted, two days silence, then a week. The nightmare was over! At least we thought so… a month after we isolated the problem, a server restarted and came up with corrupted data but only from the small files – including certificates. Even improper shutdown cannot cause this.

Poking around in the source code of the kernel looking for the trim related code, we came to the trim blacklist. This blacklist configures a specific behavior for certain SSD drives and identifies the drives based on the regexp of the model name. Our working SSDs were explicitly allowed full operation of the TRIM but some of the SSDs of our affected manufacturer were limited. Our affected drives did not match any pattern so they were implicitly allowed full operation.

The complete picture

At this moment we finally got a complete picture of what was going on. The system was issuing a TRIM to erase empty blocks, the command got misinterpreted by the drive and the controller erased blocks it was not supposed to. Therefore our files ended-up with 512 bytes of zeroes, files smaller than 512 bytes were completely zeroed. When we were lucky enough, the misbehaving TRIM hit the super-block of the filesystem and caused a corruption. After disabling the TRIM, the live big files were no longer corrupted but the small files that were once mapped to the memory and never changed since then had two states – correct content in the memory and corrupted one on the drive. Running a check on the files found nothing because they were never fetched again from the drive and just silently read from the memory. Massive reboot of servers came into play to restore the data consistency but after many weeks of hunting a ghost we came to the end.

As a result, we informed our server provider about the affected SSDs and they informed the manufacturer. Our new deployments were switched to different SSD drives and we don’t recommend anyone to use any SSD that is anyhow mentioned in a bad way by the Linux kernel. Also be careful, even when you don’t enable the TRIM explicitly, at least since Ubuntu 14.04 the explicit FSTRIM runs in a cron once per week on all partitions – the freeze of your storage for a couple of seconds will be your smallest problem.

TL;DR

Broken SSDs: (Drives on which we have detected the issue)

  • SAMSUNG MZ7WD480HCGM-00003
  • SAMSUNG MZ7GE480HMHP-00003
  • SAMSUNG MZ7GE240HMGR-00003
  • Samsung SSD 840 PRO Series
    recently blacklisted for 8-series blacklist
  • Samsung SSD 850 PRO 512GB
    recently blacklisted as 850 Pro and later in 8-series blacklist

Working SSDs: (Drives on which we have NOT detected the issue)

  • Intel S3500
  • Intel S3700
  • Intel S3710
  • Tyler Durden

    Fake!

    • http://www.bluebyte.co.nz leeh007

      Why Fake?

      • leeh0007

        They are going by the name ‘Tyler Durden’, I don’t think you are going to get a logical response.

    • LWWZ

      What’s your source for this opinion?

  • Nick Price

    Using an experimental filesystem and experimental TRIM support on Ubuntu doesn’t seem like the best idea for a variety of reasons.

    There are a number of platforms with more mature TRIM support than Linux (not to mention filesystems that would’ve handled these corruption issues in stride if they did manage to occur). Solaris, SmartOS, and even FreeBSD are good examples.

    Sometimes “Put the latest version of Ubuntu on it” isn’t the correct answer.

    • http://project3825.blogspot.com/ 3825

      Well, no matter the OS kernel, it is Samsung’s fault…

    • Erik Johansson

      So what is it in the TRIM support that makes Solaris, SmartOS and FreeBSD so much better? From experience you get data corruption with SSD on Solaris and FreeBSD as well, since it’s fixed by swapping to the right SSDs I’ve always blamed the controller either for UNMAP -> TRIM conversion or just shoddy firmware.

      SSDs firmware are like filesystems on their own.

      • Nebshar

        He’s alluding to the filesystem. ZFS has full checksum coverage. There are no silent corruptions (i.e. you will be able to pinpoint the problem quickly), and you can repair from parity / redundancy with full faith in the restored blocks.

        Solaris, SmartOS, and FreeBSD are not better at TRIM. They are just a lot better at error handling.

        • Benjamin Smith

          PS: ZFS on Linux works rather well for us! However, ZFS would be vulnerable to this trim issue if you don’t have some type of redundancy, EG RAIDZ.

          • Frank Drebin

            ZFS on Linux doesn’t support TRIM/discard, so you won’t run into this problem.

      • J

        If your buggy linux kernel queues the commands that lead the the unmapping of a block but send the TRIM immediately to the SSD controller and do not expect it to be executed immediately, than it would not be the controller’s fault. Don’t jump to conclusions without analysing the exact sequence of commands here.

    • Bob

      EXT4 is experimental? Not for a long time.

      • THE AXEMAN

        Yep, I have no idea what he’s talking about.

        Also, with regards to filesystems (or hell, anything) and the Linux kernel, there’s no reason *not* to have the latest kernel version — especially if you’re using an “experimental” filesystem like btrfs. (which has, by the way, given me no problems whatsoever) Kernel development moves quickly.

        And as @project3825:disqus said: within reason, OP’s kernel version or filesystem doesn’ t matter. Samsung’s drives shit the bed. The kernel didn’t.

    • PAUL Miller

      ext4 experimental? What are you using? btrfs? ha!

      • Sascha

        minix fs 😉

      • Supremebob

        I only use reiserfs, because it’s a killer filesystem!

    • Carina Wilberg

      You are truly absurdly clueless, buddy. Did you miss the part where the SSD trimmed out regions which the TRIM command did not specify? Don’t go blaming other components when this is clearly a firmware issue.

      • boogaloo

        Nothing in the write up of the problem is conclusive that there isn’t some form of problem in the kernel/fs that is interacting with the firmware which causes this. Maybe they both have problems. “There is no problem with different ssd so it must be ok” isn’t a valid argument. It’s obvious that the manufacturer doesn’t test Ubuntu though, so I would blame whoever built the server.

      • Duckeenie

        You wanna share some of that egg?

  • NotTwentyAnymore

    Could you use a font designed for reading when writing more than ten words, please?

    • http://project3825.blogspot.com/ 3825

      Which font do you recommend d?

    • michalstanko

      The font is nice and legible, the problem is probably on your side. 😉

      • 7eggert

        It’s “legible” for headlines, but even there it’s a bad font. For longer texts, it’s just bad.

        Fortunately my user-css overrides it to DejaVu Serif.

    • http://kgranger.fr/ Kevin Granger

      Hello,
      we choosed the font for his readability but sometimes shit happens with some untested configuration, can you share screenshots and config (os, browser)? Will be happy to fix that. Thanks!

  • PAUL Miller

    Then of course you have the issue of bit corruption later down the road just because of bit rot on the SSD itself. SSD’s are good for write-once read many. Writing to the same location more than 100k times usually results in bit rot.

  • http://aerospike.com/ Brian Bulkowski

    At Aerospike, we have a NoSQL database that’s been run on a lot of flash drives, for years, and we’ve found a few things:
    * Don’t use TRIM. There are too many bad controllers that (at best) do bizarre performance actions. TRIM should be a good thing…. but it’s (apparently) too complex for most controller writers.
    * Don’t use a file system. Aerospike has its own native data layout, and you can use files, but lousy things happen.
    * Some manufacturers go through bad times, so keep testing. We built a tool http://github.com/aerospike/act so we can prove to manufactures when they have bugs. If I tried to list all the devices that we tested and had firmware issues (either performance or “functionality”) it would be a long and embarrassing list. We have over 4 years of data covering a lot of manufacturers.

    • Adam Surak

      I totally agree, you should not use TRIM on enterprise SSDs. The idea to run without filesystem is interesting as we are not that far from it. Thank you for the testing tool, that’s very impressive! I’ll give it a try on our HW. Do you share the results somewhere?

  • Stu

    I’ve had weird read errors from time to time on my crucial m4 SSD as well. … When doing lots of reads (like trying to backup), and even randomly returning wrong data from files (also when i had done a lot of reads and writes building Python virtualenvs).

    Unfortunately I’m traveling so can’t order a replacement of a different make right now, and it seems to work about 99% of the time, those other times are worrying though.

    • http://gathman.org/vitae CustomDesigned

      Rotating disks store the sector number with each sector. Planning the motion of the R/W head to grab a sector is a complex realtime operation – and sometimes it get the wrong one. But it knows that immediately, and can try again, doing a seek recalibrate if needed. A momentary freeze is FAR better than silent data corruption.

      You would think SSD manufacturers would do the same. If you ask the drive for sector 123456, and the FTL (Flash Translation Layer) points to sector 7654321, the controller can do a sequential search for the correct sector and fix the corrupted FTL. A freeze during that process if FAR better than silent data corruption.

  • hector

    Have you considering using a SAN instead of raw SSDs? Nimble can provide you with the performance of an SSD and the reliability of hard drives since it uses SSDs only for caching

    • Adam Surak

      Since we rent most of our servers, it would be complicated to force the providers this direction, but for our setups it would be much easier. Thank you for the tip!

    • Stefan Seidel

      So, I don’t know, but for me, shifting the probability of an error to the *most frequently accessed* blocks doesn’t exactly sound like a win to me.

      Don’t misunderstand me, I like the concept of SSD caching, but for the problem described in *this* blog post, it would have probably made things much worse.

    • Frank Drebin

      Price is always a factor, but you’ll also be working with a black box that you can’t even debug properly.

      • http://gathman.org/vitae CustomDesigned

        My home server running CentOS with Gigabit Ethernet is a SAN. iSCSI is simple to set up. The server has RAID (I prefer RAID-1 or RAID-10), and backups.

        That is great for desktops. But you can’t bring a SAN with you on the road.

        • Frank Drebin

          You can actually cobble together a pretty decent SAN with currently available open source technologies if you so choose. Just combine two storage servers with RAID, DRBD over Infiniband as well as connecting the servers over infiniband (or iSCSI if you must). With iSER you’ll have great performance for exported volumes, but you can also run any other protocol you choose, you’re able to debug and fine tune as you like, deploy using puppet/chef/saltstack/ansible/whatever and you don’t have vendor lock-in. All on commodity hardware. You could even have your cache / filesystem log on NVDIMM.

          Still doesn’t help you when the components are crap though, as many SSD these days are.

  • http://blog.4zal.net/ Karol “Zal” Zalewski

    Consumer grade Samsung SSDs with newest firmware declare that they support queued TRIM. The funny thing is that they don’t do that [1] – they lockup (power down is needed). And now you say that other Samsung disk do similar thing for un-queued TRIM? Scary thing…

    [1] https://bugs.launchpad.net/ubuntu/+source/fstrim/+bug/1449005/comments/53

  • Heinz Kurtz

    NCQ causes a lot of issues with a lot of drives and it wouldn’t be the first time Linux blamed drive firmware for its own bugs. I’ve not seen any performance issues since disabling NCQ globally so I rather not risk it. There should be a stress test tool that does random read/write/delete/trim and verifies integrity. Something like a badblocks for trim, that you could run on each new setup for a day or so before going productive.

    The problem with TRIM is that it’s literally everywhere. Enabled by default and performed unasked by mkfs and other tools. Some people set issue_discards in lvm.conf because they misunderstand that as allowing trim instead of having LVM itself perform wantonly trim on each lvremove lvresize snapshot etc.

    If possible do TRIM in a cron-job using fstrim or such. If things blow up right after the cron-job ran it’s easier to pinpoint the cause of the trouble.

    • Heinz Kurtz

      Oh and if you want to run your own tests, beware: Linux cheats. It does not remove from its caches what has been trimmed. So the only way to verify that data is still intact after TRIM is to drop caches to force re-read from hardware (and hope the hardware does not have its own cache).

  • anon

    BTW, this also sounds similar to the recent mdraid bug with raid0 that was introduced in various stable kernels, where TRIM would actually trim the wrong files/regions, zero-ing out random files on the file system. Seems like TRIM is pretty dangerous. 🙂 See: https://lkml.org/lkml/2015/5/21/167

    • Adam Surak

      This got introduced in 4.0 afaik. We thought it might be the raid0 but since it did not happen on different drives with identical software stack, we had to go lower.

  • David

    TRIM or not TRIM, since SSDs are available I’ve always found it was better to have an end-to-end Intel solution aka motherboard with Intel Chips & an Intel SSD => Intel is the only manufacturer around that can implement his hardware properly because of this. Just remember how much OCZ’s SSD firmware sucked and how it led to performance issues and data loss.

    • Adam Surak

      I used to work for Intel and if I have a choice, I tend to stay with Intel.

    • Heinz Kurtz

      I’ve had loads of issues with Intel hardware. Intel SSD had the 8MB bug. Intel mainboard had stability issue and stopped getting updates rather quickly. Intel CPU (Haswell) is incompatible with Linux after microcode update (can’t handle TSX or whatever is disabled on-the-fly). Hardware properly is a fairy tale especially when it comes to SSD. I’ll take Intel CPU and NIC but I don’t think they’re immune to any kind of issues.

      • ZeDestructor

        Intel boards: there’s a reason they discontinued the entire lineup of motherboards.

        TSX-NI: they found hardware bugs. Unfixable hardware bugs. So they pushed a microcode update to disable it instead. Any breakage comes either from your programs requiring TSX (in which case why are you compiling then with TSX requirements?) or elsewhere.

        SSDs: so far, Intel SSDs have been the gold standard of reliability for the consumer lines.

  • Heinz Kurtz

    The “512 bytes” seems very weird to me, by the way. Are your partitions misaligned by 512 bytes? Normally the SSD use much larger page sizes internally (>=4k) so 512 byte is difficult to get wrong.

    • 7eggert

      If I may guess, the firmware could split and combine sector groups across it’s blocks, so if you write by “sector”, you end with a block containing sectors from different files. Then that other file gets erased, and thus the whole block.

      File a: (Data-a, nothing, nothing, nothing, nothing, nothing, nothing, nothing)
      File b: (Data-b, Data-b, Data-b, Data-b, Data-b, nothing, nothing, nothing)

      On disk: (Data-a, Data-b, Data-b, Data-b, Data-b, Data-b, nothing, nothing)
      After trim: (nothing, …)

      • Heinz Kurtz

        Even so it still seems odd. Would the drive really put the first 512byte of a (larger) file into the same block as some random other file. I could understand if it was an entire page (4k, 8k, whatever the samsung uses – it certainly isn’t 512 byte?) but it just seems fishy somehow.

        Then again not the first time fishy things in Samsung SSD… old data gets slow bug, eh?

        • The guy who was

          Some SSDs have different sector sizes.

        • 7eggert

          I guess they want to be smart. If you write two small files, they’d each use a whole block. But if you put them into one block, you’ve created a spare block. These spare blocks add to the free space and speed up further write accesses just like the blocks you generate using discard().

          But this algorithm isn’t good enough to only collect small files, your database will randomly make it seem as if the writes go to small blocks. On other occasions, the files will grow, the sectors will get fragmented. It’s still a win because of the extra free blocks – unless you get it wrong while trimming the data.

          • Heinz Kurtz

            I still think it’s odd. Also that engineers have to come to the server or the server to the engineers. Linux is Open Source, surely there is a way to make it record all commands it sends to the SSD and replay to reproduce the issue?

            That they have to travel about means no one can reproduce it anywhere else, so it may still be something local to their setup (like a faulty controller that shifts data by 512 byte but not TRIM requests – those SSDs don’t have a windows alignment jumper do they?).

            For Samsung it’s a PR matter with all the attention the blog is getting. For users (whose data may or may not be at risk) it’d be better if it turned out to be a false alarm after all. We will see…

    • http://www.algolia.com/ Julien Lemoine

      512 bytes is not weird, it is the default on a lot of devices (mainly for backward compatibility reasons).

      Here is the output of parted on one SM843T SSD:
      Model: ATA SAMSUNG MZ7WD480 (scsi)
      Disk /dev/sdc: 480GB
      Sector size (logical/physical): 512B/512B

      • Heinz Kurtz

        It represents 512 byte sectors to the OS for compatibility reason, but that’s no relation to how it works internally. I can create a single sector partition on my SSD and I can TRIM it with blkdiscard and it will say it successfully trimmed but even after dropping caches the data is still there so it didn’t trim nothing. The SSD simply does not track erased’ness by the 512 byte, it uses pages of 4K or more. The minimum size my SSD trims and actually have the data gone is 4K and those 4K must be aligned.

        512 byte is a silly unit, has been too small to be useful for a long time, even old filesystems like ext2 have a bare minimum of 1K and if a filesystem is able to squeeze several small files into a single block they give it special names…

        Off by 512 byte errors are weird. More likely it’s the data that’s off by 512 bytes by having partitions that are not aligned. TRIM should not kill the wrong stuff even so (and since misaligned partitions are common I doubt this is the issue) but who knows where that error happens as TRIM is handed down through all sorts of storage layers.

        Anyway, looking forward to the solution of the mystery 🙂

        • Frank Drebin

          Yeah, erase blocks are a lot larger than 4k.

  • Spammer

    Hi, this isn’t exactly related, but I’d be grateful if you looked at it since you seem to be more qualified. After seeing this article I thought that maybe my SSD is the cause of the problems with my home computer. It’s a GOODRAM C40 MLC drive with Phison S3108 controller, but I also have used few Samsung 840 and 840 Pro drives and I had similar issues. It’s very well described here and in the reddit thread linked there: https://answers.launchpad.net/ubuntu/+question/267542 I provide a lot of information in the comments. Few days ago another thing got broken – dpkg doesn’t work: https://bugs.launchpad.net/ubuntu/+source/dpkg/+bug/1464927

    Is it possible that it’s caused by the SSD? Do you know ways to somehow prove its ssd? Should I buy an Intel one?

  • Yel

    So does this apply to Windows Server? If I am running one on 840 Pro should I change asap? I’ve had few random system lockups I can’t figure out, could trim cause this?

    • Adam Surak

      Not sure about Windows. Definitely worth testing for consistency of the files and the FS.

    • HenkPoley

      A TRIM bug is a TRIM bug I guess. You could test it by writing known good files (e.g. SHA256 hashed) to a sacrificial drive, do some delete operations, etc. Then check the remaining known files.

  • Diogo Carvalho

    Complete non-savvy user here, but very concerned end-user on Mac OS X with trim force enable via Trim enabler, on a Fusion drive with one Samsung 840 PRO 256Gb. Am I on the verge of finding my life work corrupted? I’ve disable trim now, but have extensive backups that should have permeated the possibly corrupted zeroed files on them. Please advise.

    • Adam Surak

      Hard to say how Mac OS X behaves, and mainly in Fusion drive. Definitely keep backups and if you come to a weird file, take a look for zeroed blocks.

      • Diogo Carvalho

        Thanks for the input, but won’t the possibly zeroed affected files be permeated to the backups? How do I check for zeroed blocks?

        • Theo Brinkman

          If you use TimeMachine for backups,you’ll be able to roll back to versions *before* the zeroed block happened. To *see* the zeroed block, you’ll need to open the file in a hex editor, or something similar. A block of 512 bytes worth of ’00’ will stand out pretty easily.

    • Heinz Kurtz

      > Am I on the verge of finding my life work corrupted?

      Anyone who does not have backups is on that verge. And Apple’s idea of not enabling trim is understandable when it’s a) often buggy and b) not really as important as people make it out to be. SSD work well without TRIM.

      • You are wrong

        Backups don’t save you from file corruption.

        • Heinz Kurtz

          But they do, unless you’re cracknob enough to delete your old backups in favour of new ones, thereby replacing intact backups with corrupt ones…

          • David C.

            Yes, but how far back do you want to keep them? A few months? A few years? Forever? When a whole system backup can fill a few hard drives, you probably can’t afford to keep several dozen lying around just in case.

          • Heinz Kurtz

            As far back as possible. I prefer incremental backups that just add new/changed files instead of re-backing up the same files over and over again. It’s not going to use more or less storage if you overwrite your old intact copy with a new corrupted one.

            Maybe I’m too oldfashioned though. I still don’t use ZFS/btrfs. 😉

            For small static data (photos) I make extra copies to optical media. At least they survive when lightning hits while your entire HDD zoo is working on those backups…

      • Rob Cadwallader

        I think this is an important distinction that a lot of people don’t understand. TRIM is a hook that allows an OS to invoke garbage collection (usually through the recycle bin/trash can function in the OS). Disabling TRIM in the OS doesn’t disable garbage collection, it simply diverts the garbage collection to the firmware on the SSD controller instead of being invoked at the OS level.

        • David C.

          Not quite. Trim doesn’t trigger garbage collection. TRIM tells the drive firmware that a block is no-longer in use so it can be garbage collected at some later time. The reason it improves performance is that without TRIM, the drive doesn’t know that the block is no longer in-use until the OS wants to overwrite it with new data, so there typically ends up being a lot of garbage that can’t otherwise be collected. This undetectable garbage will be preserved and moved to different physical locations on the SSD as garbage collection runs. When TRIM is used, then those blocks don’t get preserved, so the garbage collection runs more efficiently.

          This has nothing to do with emptying trash. That deletes files, but file deletion simply marks the blocks as free in the OS’s file-system structures. The drive itself doesn’t know anything happened unless the OS explicitly tells it that they are now free – which is what TRIM does (and that’s all TRIM does.) And TRIM does this whenever files are deleted, whether or not a trash/recycler system is used to facilitate that deletion.

          The fact that some manufacturers have critical bugs, where the drive is garbage-collecting the wrong blocks doesn’t change any of this.

  • syxbit

    Nooooo. I just bought a Samsung 850 Pro. I’ll be on the lookout

  • Mike Blaszczak

    What does “8-series” mean?

    • Adam Surak

      Means that the drives are called Samsung 8-something. Like 840, 850, 843 etc.

  • Steven Howe

    I had that issue with a Crucial SSD 120 GB drive. I finally gave up and tossed it. The speed boost was not worth the rebuild times.

  • Theodore Ts’o

    Nearly all of the ext4 bugs that have been fixed in the last couple of releases are ones which were around for years, and very hard to trigger (in one case we fixed a bug in ext4 that was also in ext3, and had survived multiple rounds of enterprise linux release testing by IBM, HP, Red Hat, and SuSE), or are ones which were introduced during the merge window and for which the fix was quickly caught and pushed out before the kernel was released. The sort of bugs that we worry about are ones where you have blocks that are being written using async I/O and direct I/O racing with the same blocks getting allocated or truncated or punched out using the new punch hole functionality. Most applications don’t have these sorts of extreme workload characteristics; they just append to a file, or rewrite blocks that were already allocated.

    In particular, if you are using a standard 4k block file system, with the default mkfs options, I really wouldn’t worry too much. The bugs tend to happen in the newer, more esoteric file system features, such as inline data, bigalloc file systems (especially if you are doing random sparse writes tracing with truncates/punch hole operaitons), etc. But that’s not how most people use ext4.

    If it will make you feel any better about file system bugs, we run very careful regression testing during the entire file system development process, and if a commit causes a regression, I’ll drop the patch. And the regression tests are generally far more rigorous than typical file system workloads.

    • Adam Surak

      I still have great confidence in the work you do and great respect for it

      • Heinz Kurtz

        Any news?

  • l0rd carlos

    Fuck, I got two Samsung 850 pro in my Windows Desktop. Though sometimes I fiddle around with linux.

    Do I need TRIM if I got overprovision / unpartioned space?

  • Venky Anand

    Not sure if you use the Self Encryption option in the drive. If yes, there is a possibility of TRIM being triggered internally by the controller when you issue cryptographic erase.

  • Rick

    So what consumer level SSDs are worth buying? Intel 730, 520, 320, 1500? Plextor? Anything else?

    • Spammer Rick

      I ordered Intel 730. From my research 730, 320, 310 and X-25 have Intel controllers, probably 9 series too, I don’t know. The data center SSDs have intel controllers too. Everything else has Sandstorms which are much worse according to this (intel 330) http://www.aerospike.com/docs/operations/plan/ssd/ssd_certification.html

      This 730 that I ordered uses the same controller as data center SSDs Intel S3500 and Intel S3700. It also supposedly has power loss protection: https://communities.intel.com/thread/75984 and AES encryption. According to this http://www.anandtech.com/show/7803/intel-ssd-730-480gb-review it seems like the whole drive is overclocked S3500 so better than it?! Probably not since it doesn’t seem to have HET NAND (more durable) and uses normal MLC NAND. That said it’s likely it will be close to the outstanding performance of S3500, test of which is here: http://www.aerospike.com/docs/operations/plan/ssd/ssd_certification.html

      I’ll share with you how it performs and if it solves my computer problems.

    • http://www.your-moms-vagina.com/ Hugh Briss

      I have yet to have a single problem with the discount brands SSD snobs turn up their noses at: Mushkin, Patriot, Apotop, etc. They may not be the coolest SSDs in existence but they’re plenty fast and don’t seem to fail easily.

  • Pascal

    Why do you alert for a problem with a single machine (and get woken up at night) if it does not impact the service?

    • Adam Surak

      Because it was an error that we have not seen before so it triggered a weird behaviour in the application itself. For most of the issues, we do not page.

  • Laci

    Were the drives using the latest firmware? Samsung has a tool to update it but it only works on windows.

    • Adam Surak

      Yes, the firmware was the latest. And till this moment there is nothing newer.

  • David C.

    Sounds like the Linux community needs to do what Apple did. Don’t allow TRIM to be enabled, except on drives that have been explicitly tested and proven compatible. Don’t assume “compatible” if the drive isn’t on either list.

    Of course, since Apple only tests drives that they ship in their computers, this rule effectively means that third-party SSDs have no TRIM support unless you apply a hack. But that’s definitely better than enabling it on an incompatible drive and discovering corrupted files all over the place.

    • Heinz Kurtz

      The Linux world should definitely tone down on the TRIM. There’s too much of it everywhere.

      mkfs shouldn’t trim. If I wanted it to do that, I could always use blkdiscard first.

      • David C.

        I will politely disagree with you. I think using TRIM whenever you erase files or file systems is a good thing. The fact that some manufacturers have implementations that end up freeing the wrong blocks simply means that those drives should not be used, not that there is a problem with TRIM itself.

        • Heinz Kurtz

          Have you ever used photorec? Never? Honestly? Good for you. Good for everyone else? People make mistakes. On HDD you can revert some of them. On SSD it’s game over, cause SSD happily throw away gigabytes of data in an eyeblink.

          It’s may be worth the extra performance and lifetime on a hot database server. For the average Linux User it may be different.

          I’m not asking to remove functionality. I just wish for saner defaults.

          fstrim offers best of both worlds. TRIM yes, no long term performance degradation, but it’s not instant so there’s still time to disable the weekly/monthly fstrim cron job after sending the wrong rm command.

          But that only works if other programs stop issuing TRIM without asking you.

          • Hugo Vinícius

            I’ve heard of problems involving the TRIM command, but I’ve never gave any importance to them. Now, reading this article, I’ll be more cautious about it, even because I have two 850 Pros and one 850 EVO.

            For recovery of accidental deleted files, I always turn to File History in my Windows 8 machine. Also, all my data are on SSDs, but with periodic, manual copies to HDs and my laptop. Since then, I don’t remember of losing a file anymore.

            PS: sorry for my English. It isn’t my native language.

          • Dooshe Nozzle

            Your English is quite good. No need to apologize.

  • Uwe Brauer

    JFS?
    I am using a Samsung 840 no pro, and I have not run into any problem. However I am using
    jfs, could that make the difference?

  • issafram

    First of all, thank you for this detailed blog entry about your experience.
    Second, I have been eyeing a Samsung 850 PRO 1TB drive for almost a year now. Do you recommend that I stay away from the entire 850 series until Samsung issues a firmware update or what?

  • Hans-Peter Jansen

    What makes this issue even worse, is Samsung’s absolute lousy way to deal with it. Sure, you’re lucky: they even send an engineering team for examination, but in the end, they’re going to sweep the problem under the carpet as good as they can – as usual. No public errata listing, no pro-active problem solving, no definitive user advice.

    May be, in some month, there will appear some revised firmware blob on the support pages. If we, the owner of these buggy devices, are lucky enough to co-relate our issues to them. Needless to say, that this firmware blob will be packed in a windows exe, and of course, it will be a major hassle to flash it in non windows environments.

    Updating drive firmware takes a lot of confidence in the manufacturer. This is something, that Samsung has to earn from its customers, and is something, they have missed so far.

    Intel learned it the hard way (fdiv bug). Before, they thought, it’s bad for their revenue to spread the word about such problems, now they perform an extensive errata management (not, that this couldn’t be excelled ;).

    • Heinz Kurtz

      It will do no one any good to rush or panic without any plan. What’s Samsung supposed to do as long as the root cause is unknown and they’re not even able to reproduce the issue (nor is anyone else)? If there is a firmware bug it’s not one that is triggered easily, engineers don’t have to travel for those.

    • Frank Drebin

      At the very least I would expect my money back if they ship a defective product. That’s already the law in most countries.

    • James

      Looks like Samsung dealt with it just fine 🙂

      • Hans-Peter Jansen

        Yes, and I’m glad to hear that.

        OTOH, it’s hard to believe, that only CERTAIN Samsung devices are affected, but e.g. no Intel ones, given, that the failure is located in the md RAID 0/10 DISCARD handling. Something still fishy here.

  • issafram

    What is the latest update after the Samsung visit?

  • Christian Affolter

    Any news from Samsung already? I’m keenly awaiting your next update, as we also experience similar issues within parts of our own infrastructure, running the same SSDs. Until now, we do not know if it’s the RAID controller or the SSD which is to blame. Thanks for keeping us posted.

    • Heinz Kurtz

      Which raid controller(s) are you using?

      • Christian Affolter

        The affected nodes are using either Adaptec 7805 or 72405

    • Adam Surak

      Just posted an update. No progress so far. The only way around is to disable any usage of trim in the meanwhile. You experience the same issues?

  • http://gathman.org/vitae CustomDesigned

    Some practical end-user tools:
    rpm -Va | grep ‘^..5’ # system files with md5 not matching RPM database. Only config files should be changed.

    rsync -ravnc /home/ /my/home/backup # compares m5 of user files to last backup. Look for unexpected mismatches.

    It would probably be worth writing a utility that looks for aligned blocks of 512 zeros, and then compares md5 with a backup, or uses rpm -V

  • Marek

    Hi, just FYI i have been using TRIM with Samsung 850 PRO 512GB in OS X Yosemite since couple of weeks, so far no problems.

    • Mancave001

      Me too. Except mine’s the Evo.

  • Dexter

    I am worried Samsung disappeared them. Are you guys OK?

    • Heinz Kurtz

      They probably got abducted by aliens and are still in the middle of anal probing.

      • Adam Surak

        Still alive, not aware of any probes, but even to Neo it looked like a dream 😉

  • l0rd carlos

    Any updates on this?

    • Adam Surak

      Just posted, sorry for the delay.

  • Stanislav German-Evtushenko

    Does any of software on top of mdraid use O_DIRECT?
    https://bugzilla.kernel.org/show_bug.cgi?id=99171

    • Heinz Kurtz

      Interesting. After running ./a.out /dev/mdX, and running a sync_action
      check afterwards, mismatch_cnt = 128. But I could not reproduce it with HDD
      storage (mismatch_cnt = 0). md *does* behave oddly with loop/tmpfs devices, not sure if fault of md or loop or tmpfs.

      • Stanislav German-Evtushenko

        It happens for any block device. You probably should adjust delay and counts to reproduce it.

        • Heinz Kurtz

          Can’t reproduce it; but how often does a software write, and have an other thread race-modify the data it is writing? Even if the RAID does not go out of sync, is what you have written not garbage in any case? I think there is something similar in the RAID documentation where it may go out of sync when swapping (when the swap decides its no longer needed in mid-write).

          Just trying to understand your findings and whether it would affect me in practice 🙂

          • Stanislav German-Evtushenko

            I am able to reproduce it on any hardware with Ubuntu 14.04 and latest updates.

            There is a case when we do not care if some data was in flight while part of this was changed. In that special case the only important thing is to get the same data from block drive subsystem you put there before. And I know at least to examples when it applies.

            First example is a swap file. If you have a VM which uses O_DIRECT on top of DRBD or MDRAID and this VM actively uses swap (valid for both Windows and Linux guests) then you’ll get these DRBD and MDRAID inconsistent soon.

            Another example is a cache file in some applications. I’ve got this issue using Graylog2 inside a VM on top of DRBD. I have not done much investigation but my guess is that some of components of Graylog2 use cache file opened with O_DIRECT flag.

          • Heinz Kurtz

            See man md, /unexpected mismatch – it seems to be documented. Data that changed in mid-write was invalid to begin with and will never be read back, that’s the assumption the RAID system seems to make in this particular case.

          • Stanislav German-Evtushenko

            This is it. That’s why it does not happen when host cache is involved and happens only when data comes to block device directly from userspace where it can be changed at any moment and we can’t get to know in advance if any application is going to behave like this or not. In case of hypervisor this behaviour usually can be avoided as we can configure it so it does not use O_DIRECT.

    • Adam Surak

      No, we do not use O_DIRECT in our software. But maybe fstrim?

  • issafram

    Hey @adamsurak:disqus, people are really hoping you could give an update. Even if Samsung is still looking into it. Any sort of word from you on the latest would be helpful. From my perspective, Amazon’s Prime Day is coming up. I was hoping to buy this solid state drive this coming week. But this blog entry has caused me to hesitate. I think others can relate.

    • Adam Surak

      I have just posted an update. Sorry for the delay.

  • Anders Aagaard

    Checking in on this every now and again, really appriciate you guys trying to get to the root of the problem, and updating the blog along the way.

  • Allan

    So does this change your mind at all on Samsung’s level of cooperation?

    • Heinz Kurtz

      Looking forward to the kernel patch, that should give us the whole picture then. 🙂

      • leexgx

        really need to to get them to push out the Qued Trim firmware fix (disable the advertising support for it), for samsung it be better if they just simply stop advertising a feature(s) they don’t support it correctly resulting in errors (crucial have issued fixs for there most of there current SSDs, the older SSD m550? still has the queued trim bug does not seem they are going to issue a fix for the older drive)

  • Gerhard Islinger

    While all seems to center on Samsung and it being a Linux bug… In that case i find it puzzling that it works with Intel SSDs…? Could anyone test it on Intel?

    Or am i missing something?

    • Heinz Kurtz

      We are still missing the patch itself, it should solve the mystery. Bugs can be weird, and trigger only under obscure circumstances, if it happens only under specific loads it could be a timing issue or whatever.

  • http://gustik.eu/ Lars Schotte

    First things first, when I read Samsung Engineers, I ask myself if it is being meant seriously. Believing in Samsung is like believing in Santa Claus. So I would be interested if these Problems also have risen on other Operating Systems like FreeBSD, because I do not believe that different TRIM triggering implementations would have the same problem. So a Bug in Linux kernel is a good theory, but it happens just much more often that Samsung messes things up and then is not ready to stand for it. I am using Kingston SSDs and Crucial and never had some problems, yet. I have to state clearly that I do not believe in Samsung at all. Why are the Intel SSDs working? Care that to investigate?

    • 7eggert

      Believing in Samsung Engineers is a sane thing, things don’t work that bad by accident.

      The hardware is usually good, but the software is sometimes non-usable.

  • SSD-User

    So, it’s July 20th, but I haven’t read anything about the release of Samsung’s information. Does anybody know more?

  • john Doe

    So any updates on the Samsung Patch ???

    • 7eggert

      A blog referred me to:
      https://bugs.launchpad.net/ubuntu/+source/fstrim/+bug/1449005
      (queued TRIM commands don’t work in spite of being announced by the firmware).

      This may or may not be connected.

      • SSD-User

        It is not connected, and Fefe should know this. 😉
        See the June-16-update of this blog-article.

    • Heinz Kurtz
      • Allan

        I believe Adam should post a follow-up given the stir this blog created in the community and the negative press it gave to Samsung. Does this patch work with their software to eliminate the issue using their tests?

      • 7eggert

        Also I’m not sure what’s the thing about the blacklist. The affected disks were not blacklisted in the initially used kernel, but I don’t see weather blacklisting the affected disks for queued TRIM was tried and weather that did help.

        (I guess the SSDs are configured to use the 512-byte-block interface.)

  • http://www.appwench.com Neirinck Mike

    Dear Sir, in your article you state all damage was caused by a bug in the linux kernel. Even more it seems any SSD could get corrupted. May I ask why Intel showed no data loss contrary to Samsung? Yours sincerely

  • Jack Micheal

    Hello

    Carefully read your blogs ,and i like them!

    I do also want one of my how-to IT article to be shown in front of

    many readers who has this issue,

    would you post it on your blog or just simply add a link?

    Of course. i can give you something in return.

    Email:sunny.sun@minitool.com

    Regards

  • Santiago Franco

    Hi, just bought a new Samsung SSD 850 EVO (Not Pro), and installed Ubuntu 15.10. After reading this, (I had some strange index errors after a couple of days with everything fine). I disabled Trim and now it works, (login takes 5 mins approx, which should be much faster with a SSD), at least no error messages on booting. Does anybody know if this is fixed, and if not, should the SSD drive be ok without TRIMs or
    will performance degrade with use? Thanks!

  • Rimas Kalpokas

    So, did the kernel patch found its way into Ubuntu 14.04 kernel? I tried but found no information on this sadly..

  • Mark

    Is that TRIM command effective enough for any average user? As I understood, it’s going rather about security standards and not real security issues, am I right? Also i saw an article from a recovery team that ssds are also difficult to recover (no matter if it’s a data owner or a thief) https://hetmanrecovery.com/recovery_news/securely-destroying-information-the-issue-of-solid-state-drives.htm.

  • diabolik62

    Hello,
    I searched the net but can not find anything that I ssd samsun with new archlinux installation, he has gnerato this fstab:
    [eva@ectorpc ~]$ sudo cat /etc/fstab
    [sudo] password di eva:
    #
    # /etc/fstab: static file system information
    #
    #
    # /dev/sda4
    UUID=81d5807b-de09-4be6-9759-56fb595f954e / ext4 rw,relatime,stripe=8191,data=ordered 0 1

    # /dev/sda3
    UUID=9AC7-6A17 /boot vfat rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro 0 2

    [eva@ectorpc ~]$
    my lsblk
    [eva@ectorpc ~]$ lsblk
    NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
    sda 8:0 0 232,9G 0 disk
    ├─sda1 8:1 0 1G 0 part
    ├─sda2 8:2 0 109,9G 0 part
    ├─sda3 8:3 0 1003M 0 part /boot
    └─sda4 8:4 0 13G 0 part /
    sdb 8:16 0 698,7G 0 disk
    ├─sdb1 8:17 0 400M 0 part
    ├─sdb2 8:18 0 260M 0 part
    ├─sdb3 8:19 0 128M 0 part
    ├─sdb4 8:20 0 606,6G 0 part
    ├─sdb5 8:21 0 72G 0 part
    ├─sdb6 8:22 0 854M 0 part
    ├─sdb7 8:23 0 486M 0 part
    ├─sdb8 8:24 0 450M 0 part
    └─sdb9 8:25 0 17,5G 0 part
    sr0 11:0 1 1024M 0 rom
    [eva@ectorpc ~]$
    I read the wiki arch, but I can not understand what he must do in the fstab, someone experienced I can indicate what should I add to the fstab?
    or
    I can leave the generated fstab.

    I also enabled fstrim.service

    [Ectorpc eva @ ~] $ systemctl status fstrim.timer
    ● fstrim.timer – Discard unused blocks once a week
    Loaded: loaded (/usr/lib/systemd/system/fstrim.timer; enabled; preset vendor:
    Active: active (waiting) since Wed 10/19/2016 14:45:05 CEST; 7min August
    Docs: man: fstrim

    19 October 14:45:05 ectorpc systemd [1]: Started Discard unused blocks once a week.

    [1] + Stopped systemctl status fstrim.timer
    I have to disable it or to leave it on.
    Thanks x the help

  • diabolik62

    sorry my ssd estern
    [eva@ectorpc ~]$ sudo hdparm -I /dev/sda
    [sudo] password di eva:

    /dev/sda:

    ATA device, with non-removable media
    Model Number: Samsung Portable SSD T1 250GB
    Serial Number: S25JNAAGB10322Y
    Firmware Revision: EMT41P6Q
    Transport: Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
    Standards:
    Used: unknown (minor revision code 0x0039)
    Supported: 9 8 7 6 5
    Likely used: 9
    Configuration:
    Logical max current
    cylinders 16383 16383
    heads 16 16
    sectors/track 63 63

    CHS current addressable sectors: 16514064
    LBA user addressable sectors: 268435455
    LBA48 user addressable sectors: 488397168
    Logical Sector size: 512 bytes
    Physical Sector size: 512 bytes
    Logical Sector-0 offset: 0 bytes
    device size with M = 1024*1024: 238475 MBytes
    device size with M = 1000*1000: 250059 MBytes (250 GB)
    cache/buffer size = unknown
    Nominal Media Rotation Rate: Solid State Device
    Capabilities:
    LBA, IORDY(can be disabled)
    Queue depth: 32
    Standby timer values: spec’d by Standard, no device specific minimum
    R/W multiple sector transfer: Max = 1 Current = 1
    DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
    Cycle time: min=120ns recommended=120ns
    PIO: pio0 pio1 pio2 pio3 pio4
    Cycle time: no flow control=120ns IORDY flow control=120ns
    Commands/features:
    Enabled Supported:
    * SMART feature set
    * Power Management feature set
    * Write cache
    * Look-ahead
    * WRITE_BUFFER command
    * READ_BUFFER command
    * NOP cmd
    * DOWNLOAD_MICROCODE
    SET_MAX security extension
    * 48-bit Address feature set
    * Mandatory FLUSH_CACHE
    * FLUSH_CACHE_EXT
    * SMART error logging
    * SMART self-test
    * General Purpose Logging feature set
    * WRITE_{DMA|MULTIPLE}_FUA_EXT
    * 64-bit World wide name
    Write-Read-Verify feature set
    * WRITE_UNCORRECTABLE_EXT command
    * {READ,WRITE}_DMA_EXT_GPL commands
    * Segmented DOWNLOAD_MICROCODE
    * Gen1 signaling speed (1.5Gb/s)
    * Gen2 signaling speed (3.0Gb/s)
    * Gen3 signaling speed (6.0Gb/s)
    * Native Command Queueing (NCQ)
    * Phy event counters
    * READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
    DMA Setup Auto-Activate optimization
    * Asynchronous notification (eg. media change)
    * Software settings preservation
    * reserved 69[4]
    * DOWNLOAD MICROCODE DMA command
    * SET MAX SETPASSWORD/UNLOCK DMA commands
    * WRITE BUFFER DMA command
    * READ BUFFER DMA command
    * Data Set Management TRIM supported (limit 8 blocks)
    Security:
    supported
    not enabled
    not locked
    not frozen
    not expired: security count
    supported: enhanced erase
    2min for SECURITY ERASE UNIT. 8min for ENHANCED SECURITY ERASE UNIT.
    Logical Unit WWN Device Identifier: 5002538d00000000
    NAA : 5
    IEEE OUI : 002538
    Unique ID : d00000000
    Checksum: correct
    [eva@ectorpc ~]$