Select capture group only


#1

I’m parsing a bunch of HTML files where I have to extract some contents. Let’s say I have the following:

<div class="someclass">some content</div>

I would like to extract some content. I’m currently targeting the whole line with a regex:

someclass">.*</div>

and then removing the markup. Is there any way to make a regex like:

someclass">(.*)</div>

And then copy only the results of the capture group, and not the entire search result?

Thanks in advance for any help!


#2

You would want to use a lookbehind assertion, but Atom doesn’t support them.

What are you doing with the content following extraction?


#3

Thanks for the answer. I’m basically scraping data from websites but the data is too irregular to setup anything fully automated ; it’s typically a couple of hundred results per page, each time with a different path.

For more regular stuff I was using import.io which lets you target the data by xpath, but I assume there’s no way to do that either?

Any other way to accomplish it I’m missing?


#4

Yes, of course. The find-and-replace package isn’t the most feature-rich find in the world, but you can do anything you can conceive of via the API.

In this case, I think you might be satisfied with just performing a find for something like \<div class=\"someclass\"\>(.*)\<\/div\> and replace it with $1, which will strip out the tags.


#5

Yes that works of course but it would leave the rest of the markup in place. Let’s say this is the whole page:

<div>
	<article>
		<div class="someclass">content 1</div>
		<img src="image1.jpg">
	</article>
	<article>
		<div class="someclass">content 1</div>
		<img src="image1.jpg">
	</article>
	<article>
		<div class="someclass">content 1</div>
		<img src="image1.jpg">
	</article>
</div>

What I’m looking at is retrieving only the “content N” parts. This search and replace would leave the rest of the DOM.


#6

In that specific case, you could just do what I suggested before, to strip the <div> tags, then use \s*\<.*\>\r\n and replace it with nothing to strip all the rest of the tags.


#7

That’s great, thank you!