Dear Analyst #75: How to extract first, last, and multiple middle names into 3 separate columns
In episode 52, I talked about how to extract text from a cell. You might use this formula to extract a certain value from a cell when your database or CSV export contains a bunch of miscellaneous prefix data. This formula relies on finding a certain character in the cell and then using the MID formula to get the data you need. Turns out this formula doesn’t solve all use cases. In particular, when it comes to getting first names, middle name(s), and last names (surnames) from a cell. Evert Scholtz left the following comment on episode 52’s YouTube tutorial: You’re in luck Evert! While the formula discussed in this post won’t account for salutations, the solution should get you on your way to getting salutations to work too. This formula trick allows you to extract the first name, middle name(s), and last name from a cell where there are multiple spaces in between names. The reason I like the formula (in particular for the middle names) is because it builds on the formula from episode 52, the previous episode on counting the number of words in a cell, and the idea that formulas are composable (also discussed in the previous episode). Link to the Google Sheet for this episode is here. Watch a tutorial of this episode: Formulas for first name, middle name(s) and last name If you just want to see the formulas, take a look below. The formulas assume your data starts in cell A2 with the output looking like this (make a copy of the Google Sheet to use formulas directly): Formula to extract the first name =LEFT(A2,FIND(" ",A2)-1) Formula to extract middle name(s) =IF(LEN(A2)-LEN(SUBSTITUTE(A2," ",""))=1,"",MID(A2,FIND(" ",A2)+1,FIND("#",SUBSTITUTE(A2," ","#",LEN(A2)-LEN(SUBSTITUTE(A2," ",""))))-FIND(" ",A2)-1)) Formula to extract last name =RIGHT(A2,LEN(A2)-FIND("#",SUBSTITUTE(A2," ","#",LEN(A2)-LEN(SUBSTITUTE(A2," ",""))))) As with all my posts, I like to dig into why and how these formulas work, so read on if you want to to see the step-by-step tutorial. The second sheet on the Google Sheet also contains the intermediate steps to get to these formulas so you can see formula composability at its finest. Creating intermediate columns to test your formulas I didn’t experiment in one cell to get any of the above formulas (especially the formula for extracting middle names). A common practice is to create what I call “intermediate” columns to test out the final formula you want to use. Columns in Google Sheets and Excel are “cheap.” They cost nothing to create, and give your colleague or end user the ability to see how you came to your solution. Typically you would hide these intermediate columns from the final output so that it doesn’t clutter your report, analysis or dashboard. In your final output, you might delete these intermediate columns and incorporate the formulas into the “final formula” (like what’s on this sheet). This makes things look really clean and tidy. If I were building something for my team or colleague, I would ultimately hide these columns from my final output so that they can be unhidden in case someone needs to check my work: 1) Get number of spaces in name The first intermediate column is “Number of spaces.” As I mentioned earlier, this is pretty much the same formula discussed in episode 74. Instead of counting the number of words in the cell, we actually just want to count the number of spaces in between the words. Listen to episode 74 to see the full explanation. The formula takes the length of the name and subtracts the length of the name without the spaces to get the actual count of spaces in between the words: The reason we need to know the number of spaces in the name will become more apparent later. 2) Find position of the first space The second intermediate column is column C. We need a formula to get the position of the first space in the name. If there are no middle names, this number should just be the position of the one and only space. The reason we need to get this number is two-fold: Helps us isolate the first nameHelps us determine the beginning of all the middle name(s) in the name It’s a simple FIND formula which returns the location of the first space the formula comes across. For the name “Phil Banks,” the position of the first space is 5: Now that we have the position of the first space all we need is the position of the last space to find the “end” of all the middle name(s). This is where the formula gets a little more challenging because there’s no way to isolate the position of that last space like we can with the first space. You would think that with the number of spaces from step 1, we could just “skip” to the last space and find the position. Unfortunately I don’t think a solution exists to “skip” to that last space. 3) Replace the last space with a random character A workaround is to use the SUBSTITUTE formula in an unconventional way. You generally use the SUBSTITUTE formula when you want to substitute everything in the text with something else. In Step 1, for instance, we place all the spaces with an empty string (“”) to get the number of spaces in the text. The SUBSTITUTE formula has a 4th optional argument that lets you replace a specific occurrence of a character. For instance, if I have the name “Tova Ionas Beirne” and I use the SUBSTITUTE formula to substitute the 2nd space for the string “SheetsRock!!!” the result would be “Tova IonasSheetsRock!!!Beirne.” So the strategy here is to replace the last space with some random character (I use the “#” symbol) and we’ll FIND that character later. Since I have the number of spaces from Step 1, that number acts as the 4th argument (last space) to put into the SUBSTITUTE formula: The result is all of the names in our list with the last space in the name being substituted with a pound/hashtag character “#.” We now have almost all of our intermediate columns to build the actual formulas to pull the first name, middle name(s), and last name from our list of names. 4) Find position of last space The hard work of putting a random character to show up in the last space of each name is done. Now we can do another simple FIND to get the position of that special character (“#”) and we’ll have the position of the last space in the name: We now have all the intermediate columns to get the first name, middle name(s), and last name into their own separate columns. If we wanted to make this a little neater, we could combine the formulas in columns D and E so that we only see numbers in our intermediate columns (or just simply hide column D). Get first name from cell The first name is probably the easiest formula to understand since it uses the familiar LEFT formula. We only need two inputs: the cell with the name and the number of characters to return starting from the “left” of the cell. Since we want to take everything up to the first space in the name, we can use the number in the “First space position” column to get that number of characters to pull out from the full name: Notice how there is a “-1” after we reference the position of the first space. We need this “-1” since the position of the first space is 1 character more than what we need to pull out of the full name. Now onto the middle name. Get multiple middle names from full name in a cell This is the trickiest formula to write if you don’t have the intermediate columns from steps 1-4 above. We use all 3 of the intermediate columns that calculate spaces (column D is hidden which replaced the last space in the name with a special character). The formula takes all the character in between the first and last space in the full name: Let’s focus on the MID formula for now. The MID formula takes three inputs: The cell you want to pull characters fromThe “start” position in the cell where you want to pull charactersThe number of characters to pull from that start position Looking at our formula, the three inputs work as follows: A5: The cell with our full name “Daniyal Senn McReynolds” in the screenshot ab oveC5 + 1: The “start” is the position of the first space. We want everything after “Daniyal” so that’s why we reference C5. Notice the “+1” at the end to make sure we pull after the first space (in the example above, this would be position 9). E5 – C5 – 1: The number of characters we want to pull is position of the last space minus the position of the first space. We need to subtract 1 to account for start of the first space. This might be best shown in this screenshot below: The only reason we need that IF statement surrounding the MID formula is to account for names that don’t have any middle names. If there’s only one space in the name, that means there’s only a first name and last name. Hence the “” if the number of spaces is 1. Get last name from full name in a cell The last name is easy now that we know the position of the last space in the name. Similar to the first name, we take the opposite of LEFT and use RIGHT to get a certain number of characters starting from the right of the text. We subtract the position of the last space from the length of the entire name to get the number of characters: Technically the LEN function could be its own intermediate column. Since the formula is so short I just wrote the LEN formula into the main formula for the last name. In the screenshot above, the length of the name “Daniyal Senn McReynolds” is 23. If we subtract the position of the last space in the name (13), we get 10 characters we need to pull from the right of the name to get “McReynolds.” Epilogue: accounting for salutations and suffixes in a name The formulas in this episode don’t account for names where you might have “Mr.” before the name or “Jr.” after the name. A few additional “intermediate” columns you may need to add to account for a bulletproof system for extracting the correct first name, middle name(s), and last name: A list of all salutations and suffixes in a separate sheet to act as a lookupUtilize the SUBSTITUTE function with the third optional parameter to replace the “correct” first space and last space in the name (e.g. if the name is “Mr. John Quincy Adams,” the position of the first space in the name would be 9).For both first name and last names, you’ll have to use the MID function instead of LEFT or RIGHT to account for the salutations and suffixes Hopefully the formulas in this episode will sufficient for your data cleaning needs if you have full names that don’t have salutations and suffixes! Other Podcasts & Blog Posts In the 2nd half of the episode, I talk about some episodes and blogs from other people I found interesting: All The Smoke #92: Ernie Johnson The post Dear Analyst #75: How to extract first, last, and multiple middle names into 3 separate columns appeared first on .