Select capture group only


I’m parsing a bunch of HTML files where I have to extract some contents. Let’s say I have the following:

<div class="someclass">some content</div>

I would like to extract some content. I’m currently targeting the whole line with a regex:


and then removing the markup. Is there any way to make a regex like:


And then copy only the results of the capture group, and not the entire search result?

Thanks in advance for any help!


You would want to use a lookbehind assertion, but Atom doesn’t support them.

What are you doing with the content following extraction?


Thanks for the answer. I’m basically scraping data from websites but the data is too irregular to setup anything fully automated ; it’s typically a couple of hundred results per page, each time with a different path.

For more regular stuff I was using which lets you target the data by xpath, but I assume there’s no way to do that either?

Any other way to accomplish it I’m missing?


Yes, of course. The find-and-replace package isn’t the most feature-rich find in the world, but you can do anything you can conceive of via the API.

In this case, I think you might be satisfied with just performing a find for something like \<div class=\"someclass\"\>(.*)\<\/div\> and replace it with $1, which will strip out the tags.


Yes that works of course but it would leave the rest of the markup in place. Let’s say this is the whole page:

		<div class="someclass">content 1</div>
		<img src="image1.jpg">
		<div class="someclass">content 1</div>
		<img src="image1.jpg">
		<div class="someclass">content 1</div>
		<img src="image1.jpg">

What I’m looking at is retrieving only the “content N” parts. This search and replace would leave the rest of the DOM.


In that specific case, you could just do what I suggested before, to strip the <div> tags, then use \s*\<.*\>\r\n and replace it with nothing to strip all the rest of the tags.


That’s great, thank you!